# Blog Archives

## How to Plot Graphs on Multiple Scales

Graphing is a vital part of any data analysis project. Graphs visually reveal patterns and relationships between variables and provide invaluable information. At times, the patterns may be interesting; however, the scaling of the data can simultaneously interfere with the messages to be conveyed.
When units and scale vary greatly, seeing detail in all variables on a plot becomes quite impossible. This is when you know your multi-variable plot needs multiple, varying scales. Let’s look at our options…
Double Y Plots
Many graphing tools have a Graph type option called Double-Y. This graph type makes it possible for you to select one or more variables associated with the left Y axis and one or more variables to associate with the right Y axis. This is a simple way of creating a compound graph that shows variables with two different scales.
For example, open the STATISTICA data file, Baseball.sta, from the path C:/Program Files/StatSoft/STATISTICA 12/Examples/Datasets. Several of the variables in this example data file have very different scales.
On the Graphs tab in the Common group, click Scatterplot. In the 2D Scatterplots Start Panel, select the Advanced tab. In the Graph type group box, select Double-Y.
Now, click the Variables button, and in the variable selection dialog box, select RUNS as X, WIN as Y Left, and DP as Y Right. Click the OK button.
Click OK in the 2D Scatterplots Startup Panel to create the plot. The result lists the two Y variables with separately determined scales.
WIN shows a scale from 0.25 to 0.65. This is the season winning proportion. The variable DP is shown on a scale from 100 to 220 and is the number of double plays in the season. Because of the great difference in the scale of these two variables, a Double-Y plot is the best way to simultaneously show these variables’ relationships with the X factor, RUNS.
Multiple Y Plots
An additional option is available for creating plots with multiple axis scales. This option is used when you need more scales than the Double-Y allows or when you need an additional axis in another place or capacity.
Continuing the same example, add a second variable, BA, to the Y Left variable list.
Click OK to create the new plot.
Now, WIN and BA share the left Y axis. BA, batting average, is on a scale of .2 to .3. Giving BA a separate Y axis scale would show more detail in the added variable. To do this, right-click in the graph, and on the shortcut menu select Graph Options. Select the Axis – General tab of the Graph Options dialog box.
From the Axis drop-down menu, select the Y left axis. Then click Add new axis. A new Y left axis is added to the plot called Y left’.
Next, the BA variable needs to be related to that axis and customized. Select the Plot – General tab to make this change.
On the Plot drop-down list, select the variable BA. Then, in the Assignment of axis group, select the Custom option button, and specify Y left’ as the custom axis.
Click OK to update the plot.

The resulting plot now has three Y variables plotted, each with its own Y axis scaling and labeling. Showing patterns and relationships in data of varying scale is made easy with multiple axes.

## Process Analysis

Sampling plans are discussed in detail in Duncan (1974) and Montgomery (1985); most process capability procedures (and indices) were only recently introduced to the US from Japan (Kane, 1986), however, they are discussed in three excellent recent hands-on books by Bhote (1988), Hart and Hart (1989), and Pyzdek (1989); detailed discussions of these methods can also be found in Montgomery (1991).

Step-by-step instructions for the computation and interpretation of capability indices are also provided in the Fundamental Statistical Process Control Reference Manual published by the ASQC (American Society for Quality Control) and AIAG (Automotive Industry Action Group, 1991; referenced as ASQC/AIAG, 1991). Repeatability and reproducibility (R & R) methods are discussed in Grant and Leavenworth (1980), Pyzdek (1989) and Montgomery (1991); a more detailed discussion of the subject (of variance estimation) is also provided in Duncan (1974).

Step-by-step instructions on how to conduct and analyze R & R experiments are presented in the Measurement Systems Analysis Reference Manual published by ASQC/AIAG (1990). In the following topics, we will briefly introduce the purpose and logic of each of these procedures. For more information on analyzing designs with random effects and for estimating components of variance, see Variance Components.

## Sampling Plans

### General Purpose

A common question that quality control engineers face is to determine how many items from a batch (e.g., shipment from a supplier) to inspect in order to ensure that the items (products) in that batch are of acceptable quality. For example, suppose we have a supplier of piston rings for small automotive engines that our company produces, and our goal is to establish a sampling procedure (of piston rings from the delivered batches) that ensures a specified quality. In principle, this problem is similar to that of on-line quality control discussed in Quality Control. In fact, you may want to read that section at this point to familiarize yourself with the issues involved in industrial statistical quality control.

Acceptance sampling. The procedures described here are useful whenever we need to decide whether or not a batch or lot of items complies with specifications, without having to inspect 100% of the items in the batch. Because of the nature of the problem – whether to accept a batch – these methods are also sometimes discussed under the heading of acceptance sampling.

Advantages over 100% inspection. An obvious advantage of acceptance sampling over 100% inspection of the batch or lot is that reviewing only a sample requires less time, effort, and money. In some cases, inspection of an item is destructive (e.g., stress testing of steel), and testing 100% would destroy the entire batch. Finally, from a managerial standpoint, rejecting an entire batch or shipment (based on acceptance sampling) from a supplier, rather than just a certain percent of defective items (based on 100% inspection) often provides a stronger incentive to the supplier to adhere to quality standards.

### Computational Approach

In principle, the computational approach to the question of how large a sample to take is straightforward. Elementary Concepts discusses the concept of the sampling distribution. Briefly, if we were to take repeated samples of a particular size from a population of, for example, piston rings and compute their average diameters, then the distribution of those averages (means) would approach the normal distribution with a particular mean and standard deviation (or standard error; in sampling distributions the term standard error is preferred, in order to distinguish the variability of the means from the variability of the items in the population). Fortunately, we do not need to take repeated samples from the population in order to estimate the location (mean) and variability (standard error) of the sampling distribution. If we have a good idea (estimate) of what the variability (standard deviation or sigma) is in the population, then we can infer the sampling distribution of the mean. In principle, this information is sufficient to estimate the sample size that is needed in order to detect a certain change in quality (from target specifications). Without going into the details about the computational procedures involved, let us next review the particular information that the engineer must supply in order to estimate required sample sizes.

### Means for H0 and H1

To formalize the inspection process of, for example, a shipment of piston rings, we can formulate two alternative hypotheses: First, we may hypothesize that the average piston ring diameters comply with specifications. This hypothesis is called the null hypothesis (H0). The second and alternative hypothesis (H1) is that the diameters of the piston rings delivered to us deviate from specifications by more than a certain amount. Note that we may specify these types of hypotheses not just for measurable variables such as diameters of piston rings, but also for attributes. For example, we may hypothesize (H1) that the number of defective parts in the batch exceeds a certain percentage. Intuitively, it should be clear that the larger the difference between H0 and H1, the smaller the sample necessary to detect this difference (see Elementary Concepts).

### Alpha and Beta Error Probabilities

To return to the piston rings example, there are two types of mistakes that we can make when inspecting a batch of piston rings that has just arrived at our plant. First, we may erroneously reject H0, that is, reject the batch because we erroneously conclude that the piston ring diameters deviate from target specifications. The probability of committing this mistake is usually called the alpha error probability. The second mistake that we can make is to erroneously not reject H0 (accept the shipment of piston rings), when, in fact, the mean piston ring diameter deviates from the target specification by a certain amount. The probability of committing this mistake is usually called the beta error probability. Intuitively, the more certain we want to be, that is, the lower we set the alpha and beta error probabilities, the larger the sample will have to be; in fact, in order to be 100% certain, we would have to measure every single piston ring delivered to our company.

### Fixed Sampling Plans

To construct a simple sampling plan, we would first decide on a sample size, based on the means under H0/H1 and the particular alpha and beta error probabilities. Then, we would take a single sample of this fixed size and, based on the mean in this sample, decide whether to accept or reject the batch. This procedure is referred to as a fixed sampling plan.

Operating characteristic (OC) curve. The power of the fixed sampling plan can be summarized via the operating characteristic curve. In that plot, the probability of rejecting H0 (and accepting H1) is plotted on the Y axis, as a function of an actual shift from the target (nominal) specification to the respective values shown on the X axis of the plot (see example below). This probability is, of course, one minus the beta error probability of erroneously rejecting H1 and accepting H0; this value is referred to as the power of the fixed sampling plan to detect deviations. Also indicated in this plot are the power functions for smaller sample sizes.

### Sequential Sampling Plans

As an alternative to the fixed sampling plan, we could randomly choose individual piston rings and record their deviations from specification. As we continue to measure each piston ring, we could keep a running total of the sum of deviations from specification. Intuitively, if H1 is true, that is, if the average piston ring diameter in the batch is not on target, then we would expect to observe a slowly increasing or decreasing cumulative sum of deviations, depending on whether the average diameter in the batch is larger or smaller than the specification, respectively. It turns out that this kind of sequential sampling of individual items from the batch is a more sensitive procedure than taking a fixed sample. In practice, we continue sampling until we either accept or reject the batch.

Using a sequential sampling plan. Typically, we would produce a graph in which the cumulative deviations from specification (plotted on the Y-axis) are shown for successively sampled items (e.g., piston rings, plotted on the X-axis). Then two sets of lines are drawn in this graph to denote the “corridor” along which we will continue to draw samples, that is, as long as the cumulative sum of deviations from specifications stays within this corridor, we continue sampling.

If the cumulative sum of deviations steps outside the corridor we stop sampling. If the cumulative sum moves above the upper line or below the lowest line, we reject the batch. If the cumulative sum steps out of the corridor to the inside, that is, if it moves closer to the center line, we accept the batch (since this indicates zero deviation from specification). Note that the inside area starts only at a certain sample number; this indicates the minimum number of samples necessary to accept the batch (with the current error probability).

### Summary

To summarize, the idea of (acceptance) sampling is to use statistical “inference” to accept or reject an entire batch of items, based on the inspection of only relatively few items from that batch. The advantage of applying statistical reasoning to this decision is that we can be explicit about the probabilities of making a wrong decision.

Whenever possible, sequential sampling plans are preferable to fixed sampling plans because they are more powerful. In most cases, relative to the fixed sampling plan, using sequential plans requires fewer items to be inspected in order to arrive at a decision with the same degree of certainty.

## Process (Machine) Capability Analysis

### Introductory Overview

Quality Control describes numerous methods for monitoring the quality of a production process. However, once a process is under control the question arises, “to what extent does the long-term performance of the process comply with engineering requirements or managerial goals?” For example, to return to our piston ring example, how many of the piston rings that we are using fall within the design specification limits? In more general terms, the question is, “how capable is our process (or supplier) in terms of producing items within the specification limits?” Most of the procedures and indices described here were only recently introduced to the US by Ford Motor Company (Kane, 1986). They allow us to summarize the process capability in terms of meaningful percentages and indices.

In this topic, the computation and interpretation of process capability indices will first be discussed for the normal distribution case. If the distribution of the quality characteristic of interest does not follow the normal distribution, modified capability indices can be computed based on the percentiles of a fitted non-normal distribution.

Order of business. Note that it makes little sense to examine the process capability if the process is not in control. If the means of successively taken samples fluctuate widely, or are clearly off the target specification, then those quality problems should be addressed first. Therefore, the first step towards a high-quality process is to bring the process under control, using the charting techniques available in Quality Control.

### Computational Approach

Once a process is in control, we can ask the question concerning the process capability. Again, the approach to answering this question is based on “statistical” reasoning, and is actually quite similar to that presented earlier in the context of sampling plans. To return to the piston ring example, given a sample of a particular size, we can estimate the standard deviation of the process, that is, the resultant ring diameters. We can then draw a histogram of the distribution of the piston ring diameters. As we discussed earlier, if the distribution of the diameters is normal, then we can make inferences concerning the proportion of piston rings within specification limits.

(For non-normal distributions, see Percentile Method. Let us now review some of the major indices that are commonly used to describe process capability.

### Capability Analysis – Process Capability Indices

Process range. First, it is customary to establish the ± 3 sigma limits around the nominal specifications. Actually, the sigma limits should be the same as the ones used to bring the process under control using Shewhart control charts (see Quality Control). These limits denote the range of the process (i.e., process range). If we use the ± 3 sigma limits then, based on the normal distribution, we can estimate that approximately 99% of all piston rings fall within these limits.

Specification limits LSL, USL. Usually, engineering requirements dictate a range of acceptable values. In our example, it may have been determined that acceptable values for the piston ring diameters would be 74.0 ± .02 millimeters. Thus, the lower specification limit (LSL) for our process is 74.0 – 0.02 = 73.98; the upper specification limit (USL) is 74.0 + 0.02 = 74.02. The difference between USL and LSL is called the specification range.

Potential capability (Cp). This is the simplest and most straightforward indicator of process capability. It is defined as the ratio of the specification range to the process range; using ± 3 sigma limits we can express this index as:

Cp = (USL-LSL)/(6*Sigma)

Put into words, this ratio expresses the proportion of the range of the normal curve that falls within the engineering specification limits (provided that the mean is on target, that is, that the process is centered, see below).

Bhote (1988) reports that prior to the widespread use of statistical quality control techniques (prior to 1980), the normal quality of US manufacturing processes was approximately Cp = .67. This means that the two 33/2 percent tail areas of the normal curve fall outside specification limits. As of 1988, only about 30% of US processes are at or below this level of quality (see Bhote, 1988, p. 51). Ideally, of course, we would like this index to be greater than 1, that is, we would like to achieve a process capability so that no (or almost no) items fall outside specification limits. Interestingly, in the early 1980’s the Japanese manufacturing industry adopted as their standard Cp = 1.33! The process capability required to manufacture high-tech products is usually even higher than this; Minolta has established a Cp index of 2.0 as their minimum standard (Bhote, 1988, p. 53), and as the standard for its suppliers. Note that high process capability usually implies lower, not higher costs, taking into account the costs due to poor quality. We will return to this point shortly.

Capability ratio (Cr). This index is equivalent to Cp; specifically, it is computed as 1/Cp (the inverse of Cp).

Lower/upper potential capability: Cpl, Cpu. A major shortcoming of the Cp (and Cr) index is that it may yield erroneous information if the process is not on target, that is, if it is not centered. We can express non-centering via the following quantities. First, upper and lower potential capability indices can be computed to reflect the deviation of the observed process mean from the LSL and USL.. Assuming ± 3 sigma limits as the process range, we compute:

Cpl = (Mean – LSL)/3*Sigma
and
Cpu = (USL – Mean)/3*Sigma

Obviously, if these values are not identical to each other, then the process is not centered.

Non-centering correction (K). We can correct Cp for the effects of non-centering. Specifically, we can compute:

K=abs(D – Mean)/(1/2*(USL – LSL))

where

D = (USL+LSL)/2.

This correction factor expresses the non-centering (target specification minus mean) relative to the specification range.

Demonstrated excellence (Cpk). Finally, we can adjust Cp for the effect of non-centering by computing:

Cpk = (1-k)*Cp

If the process is perfectly centered, then k is equal to zero, and Cpk is equal to Cp. However, as the process drifts from the target specification, k increases and Cpk becomes smaller than Cp.

Potential Capability II: Cpm. A recent modification (Chan, Cheng, & Spiring, 1988) to Cp is directed at adjusting the estimate of sigma for the effect of (random) non-centering. Specifically, we may compute the alternative sigma (Sigma2) as:

Sigma2 = { (xi – TS)2/(n-1)}½

where:
Sigma2 is the alternative estimate of sigma
xi          is the value of the i‘th observation in the sample
TS        is the target or nominal specification
n           is the number of observations in the sample

We may then use this alternative estimate of sigma to compute Cp as before; however, we will refer to the resultant index as Cpm.

### Process Performance vs. Process Capability

When monitoring a process via a quality control chart (e.g., the X-bar and R-chart; Quality Control) it is often useful to compute the capability indices for the process. Specifically, when the data set consists of multiple samples, such as data collected for the quality control chart, then one can compute two different indices of variability in the data. One is the regular standard deviation for all observations, ignoring the fact that the data consist of multiple samples; the other is to estimate the process’s inherent variation from the within-sample variability. For example, when plotting X-bar and R-charts one may use the common estimator R-bar/d2 for the process sigma (e.g., see Duncan, 1974; Montgomery, 1985, 1991). Note however, that this estimator is only valid if the process is statistically stable. For a detailed discussion of the difference between the total process variation and the inherent variation refer to ASQC/AIAG reference manual (ASQC/AIAG, 1991, page 80).

When the total process variability is used in the standard capability computations, the resulting indices are usually referred to as process performance indices (as they describe the actual performance of the process), while indices computed from the inherent variation (within-sample sigma) are referred to as capability indices (since they describe the inherent capability of the process).

### Using Experiments to Improve Process Capability

As mentioned before, the higher the Cp index, the better the process – and there is virtually no upper limit to this relationship. The issue of quality costs, that is, the losses due to poor quality, is discussed in detail in the context of Taguchi robust design methods (see Experimental Design). In general, higher quality usually results in lower costs overall; even though the costs of production may increase, the losses due to poor quality, for example, due to customer complaints, loss of market share, etc. are usually much greater. In practice, two or three well-designed experiments carried out over a few weeks can often achieve a Cp of 5 or higher. If you are not familiar with the use of designed experiments, but are concerned with the quality of a process, we strongly recommend that you review the methods detailed in Experimental Design.

### Testing the Normality Assumption

The indices we have just reviewed are only meaningful if, in fact, the quality characteristic that is being measured is normally distributed. A specific test of the normality assumption (Kolmogorov-Smirnov and Chi-square test of goodness-of-fit) is available; these tests are described in most statistics textbooks, and they are also discussed in greater detail in Nonparametrics and Distribution Fitting.

A visual check for normality is to examine the probability-probability and quantile-quantile plots for the normal distribution. For more information, see Process Analysis and Non-Normal Distributions.

### Tolerance Limits

Before the introduction of process capability indices in the early 1980’s, the common method for estimating the characteristics of a production process was to estimate and examine the tolerance limits of the process (see, for example, Hald, 1952). The logic of this procedure is as follows. Let us assume that the respective quality characteristic is normally distributed in the population of items produced; we can then estimate the lower and upper interval limits that will ensure with a certain level of confidence (probability) that a certain percent of the population is included in those limits. Put another way, given:

1. a specific sample size (n),
2. the process mean,
3. the process standard deviation (sigma),
4. a confidence level, and
5. the percent of the population that we want to be included in the interval,

we can compute the corresponding tolerance limits that will satisfy all these parameters. You can also compute parameter-free tolerance limits that are not based on the assumption of normality (Scheffe & Tukey, 1944, p. 217; Wilks, 1946, p. 93; see also Duncan, 1974, or Montgomery, 1985, 1991).

## Gage Repeatability and Reproducibility

### Introductory Overview

Gage repeatability and reproducibility analysis addresses the issue of precision of measurement. The purpose of repeatability and reproducibility experiments is to determine the proportion of measurement variability that is due to (1) the items or parts being measured (part-to-part variation), (2) the operator or appraiser of the gages (reproducibility), and (3) errors (unreliabilities) in the measurements over several trials by the same operators of the same parts (repeatability). In the ideal case, all variability in measurements will be due to the part-to-part variation, and only a negligible proportion of the variability will be due to operator reproducibility and trial-to-trial repeatability.

To return to the piston ring example , if we require detection of deviations from target specifications of the magnitude of .01 millimeters, then we obviously need to use gages of sufficient precision. The procedures described here allow the engineer to evaluate the precision of gages and different operators (users) of those gages, relative to the variability of the items in the population.

You can compute the standard indices of repeatability, reproducibility, and part-to-part variation, based either on ranges (as is still common in these types of experiments) or from the analysis of variance (ANOVA) table (as, for example, recommended in ASQC/AIAG, 1990, page 65). The ANOVA table will also contain an F test (statistical significance test) for the operator-by-part interaction, and report the estimated variances, standard deviations, and confidence intervals for the components of the ANOVA model.

Finally, you can compute the respective percentages of total variation, and report so-called percent-of-tolerance statistics. These measures are briefly discussed in the following sections of this introduction. Additional information can be found in Duncan (1974), Montgomery (1991), or the DataMyte Handbook (1992); step-by-step instructions and examples are also presented in the ASQC/AIAG Measurement systems analysis reference manual (1990) and the ASQC/AIAG Fundamental statistical process control reference manual (1991).

Note that there are several other statistical procedures which may be used to analyze these types of designs; see the section on Methods for Analysis of Variance for details. In particular the methods discussed in the Variance Components and Mixed Model ANOVA/ANCOVA chapter are very efficient for analyzing very large nested designs (e.g., with more than 200 levels overall), or hierarchically nested designs (with or without random factors).

### Computational Approach

One may think of each measurement as consisting of the following components:

1. a component due to the characteristics of the part or item being measured,
2. a component due to the reliability of the gage, and
3. a component due to the characteristics of the operator (user) of the gage.

The method of measurement (measurement system) is reproducible if different users of the gage come up with identical or very similar measurements. A measurement method is repeatable if repeated measurements of the same part produces identical results. Both of these characteristics – repeatability and reproducibility – will affect the precision of the measurement system.

We can design an experiment to estimate the magnitudes of each component, that is, the repeatability, reproducibility, and the variability between parts, and thus assess the precision of the measurement system. In essence, this procedure amounts to an analysis of variance (ANOVA) on an experimental design which includes as factors different parts, operators, and repeated measurements (trials). We can then estimate the corresponding variance components (the term was first used by Daniels, 1939) to assess the repeatability (variance due to differences across trials), reproducibility (variance due to differences across operators), and variability between parts (variance due to differences across parts). If you are not familiar with the general idea of ANOVA, you may want to refer to ANOVA/MANOVA. In fact, the extensive features provided there can also be used to analyze repeatability and reproducibility studies.

### Plots of Repeatability and Reproducibility

There are several ways to summarize via graphs the findings from a repeatability and reproducibility experiment. For example, suppose we are manufacturing small kilns that are used for drying materials for other industrial production processes. The kilns should operate at a target temperature of around 100 degrees Celsius. In this study, 5 different engineers (operators) measured the same sample of 8 kilns (parts), three times each (three trials). We can plot the mean ratings of the 8 parts by operator. If the measurement system is reproducible, then the pattern of means across parts should be quite consistent across the 5 engineers who participated in the study.

R and S charts. Quality Control discusses in detail the idea of R (range) and S (sigma) plots for controlling process variability. We can apply those ideas here and produce a plot of ranges (or sigmas) by operators or by parts; these plots will allow us to identify outliers among operators or parts. If one operator produced particularly wide ranges of measurements, we may want to find out why that particular person had problems producing reliable measurements (e.g., perhaps he or she failed to understand the instructions for using the measurement gage).

Analogously, producing an R chart by parts may allow us to identify parts that are particularly difficult to measure reliably; again, inspecting that particular part may give us some insights into the weaknesses in our measurement system.

Repeatability and reproducibility summary plot. The summary plot shows the individual measurements by each operator; specifically, the measurements are shown in terms of deviations from the respective average rating for the respective part. Each trial is represented by a point, and the different measurement trials for each operator for each part are connected by a vertical line. Boxes drawn around the measurements give us a general idea of a particular operator’s bias (see graph below).

### Components of Variance

Percent of Process Variation and Tolerance. The Percent Tolerance allows you to evaluate the performance of the measurement system with regard to the overall process variation, and the respective tolerance range. You can specify the tolerance range (Total tolerance for parts) and the Number of sigma intervals. The latter value is used in the computations to define the range (spread) of the respective (repeatability, reproducibility, part-to-part, etc.) variability. Specifically, the default value (5.15) defines 5.15 times the respective sigma estimate as the respective range of values; if the data are normally distributed, then this range defines 99% of the space under the normal curve, that is, the range that will include 99% of all values (or reproducibility/repeatability errors) due to the respective source of variation.

Percent of process variation. This value reports the variability due to different sources relative to the total variability (range) in the measurements.

Analysis of Variance. Rather than computing variance components estimates based on ranges, an accurate method for computing these estimates is based on the ANOVA mean squares (see Duncan, 1974, ASQC/AIAG, 1990 ).

One may treat the three factors in the R & R experiment (Operator, Parts, Trials) as random factors in a three-way ANOVA model (see also General ANOVA/MANOVA). For details concerning the different models that are typically considered, refer to ASQC/AIAG (1990, pages 92-95), or to Duncan (1974, pages 716-734). Customarily, it is assumed that all interaction effects by the trial factor are non-significant. This assumption seems reasonable, since, for example, it is difficult to imagine how the measurement of some parts will be systematically different in successive trials, in particular when parts and trials are randomized.

However, the Operator by Parts interaction may be important. For example, it is conceivable that certain less experienced operators will be more prone to particular biases, and hence will arrive at systematically different measurements for particular parts. If so, then one would expect a significant two-way interaction (again, refer to General ANOVA/MANOVA if you are not familiar with ANOVA terminology).

In the case when the two-way interaction is statistically significant, then one can separately estimate the variance components due to operator variability, and due to the operator by parts variability

In the case of significant interactions, the combined repeatability and reproducibility variability is defined as the sum of three components: repeatability (gage error), operator variability, and the operator-by-part variability.

If the Operator by Part interaction is not statistically significant a simpler additive model can be used without interactions.

### Summary

To summarize, the purpose of the repeatability and reproducibility procedures is to allow the quality control engineer to assess the precision of the measurement system (gages) used in the quality control process. Obviously, if the measurement system is not repeatable (large variability across trials) or reproducible (large variability across operators) relative to the variability between parts, then the measurement system is not sufficiently precise to be used in the quality control efforts. For example, it should not be used in charts produced via Quality Control, or product capability analyses and acceptance sampling procedures via Process Analysis.

## Non-Normal Distributions

### Introductory Overview

General Purpose. The concept of process capability is described in detail in the Process Capability Overview. To reiterate, when judging the quality of a (e.g., production) process it is useful to estimate the proportion of items produced that fall outside a predefined acceptable specification range. For example, the so-called Cp index is computed as:

Cp – (USL-LSL)/(6*sigma)

where sigma is the estimated process standard deviation, and USL and LSL are the upper and lower specification limits, respectively. If the distribution of the respective quality characteristic or variable (e.g., size of piston rings) is normal, and the process is perfectly centered (i.e., the mean is equal to the design center), then this index can be interpreted as the proportion of the range of the standard normal curve (the process width) that falls within the engineering specification limits. If the process is not centered, an adjusted index Cpk is used instead.

Non-Normal Distributions. You can fit non-normal distributions to the observed histogram, and compute capability indices based on the respective fitted non-normal distribution (via the percentile method). In addition, instead of computing capability indices by fitting specific distributions, you can compute capability indices based on two different general families of distributions: the Johnson distributions (Johnson, 1965; see also Hahn and Shapiro, 1967) and Pearson distributions (Johnson, Nixon, Amos, and Pearson, 1963; Gruska, Mirkhani, and Lamberson, 1989; Pearson and Hartley, 1972), which allow us to approximate a wide variety of continuous distributions. For all distributions, we can also compute the table of expected frequencies, the expected number of observations beyond specifications, and quantile-quantile and probability-probability plots. The specific method for computing process capability indices from these distributions is described in Clements (1989).

Quantile-quantile plots and probability-probability plots. There are various methods for assessing the quality of respective fit to the observed data. In addition to the table of observed and expected frequencies for different intervals, and the Kolmogorov-Smirnov and Chi-square goodness-of-fit tests, you can compute quantile and probability plots for all distributions. These scatterplots are constructed so that if the observed values follow the respective distribution, then the points will form a straight line in the plot. These plots are described further below.

### Fitting Distributions by Moments

In addition to the specific continuous distributions described above, you can fit general “families” of distributions – the so-called Johnson and Pearson curves – with the goal to match the first four moments of the observed distribution.

General approach. The shapes of most continuous distributions can be sufficiently summarized in the first four moments. Put another way, if one fits to a histogram of observed data a distribution that has the same mean (first moment), variance (second moment), skewness (third moment) and kurtosis (fourth moment) as the observed data, then one can usually approximate the overall shape of the distribution very well. Once a distribution has been fitted, one can then calculate the expected percentile values under the (standardized) fitted curve, and estimate the proportion of items produced by the process that fall within the specification limits.

Johnson curves. Johnson (1949) described a system of frequency curves that represents transformations of the standard normal curve (see Hahn and Shapiro, 1967, for details). By applying these transformations to a standard normal variable, a wide variety of non-normal distributions can be approximated, including distributions which are bounded on either one or both sides (e.g., U-shaped distributions). The advantage of this approach is that once a particular Johnson curve has been fit, the normal integral can be used to compute the expected percentage points under the respective curve. Methods for fitting Johnson curves, so as to approximate the first four moments of an empirical distribution, are described in detail in Hahn and Shapiro, 1967, pages 199-220; and Hill, Hill, and Holder, 1976.

Pearson curves. Another system of distributions was proposed by Karl Pearson (e.g., see Hahn and Shapiro, 1967, pages 220-224). The system consists of seven solutions (of 12 originally enumerated by Pearson) to a differential equation which also approximate a wide range of distributions of different shapes. Gruska, Mirkhani, and Lamberson (1989) describe in detail how the different Pearson curves can be fit to an empirical distribution. A method for computing specific Pearson percentiles is also described in Davis and Stephens (1983).

### Assessing the Fit: Quantile and Probability Plots

For each distribution, you can compute the table of expected and observed frequencies and the respective Chi-square goodness-of-fit test, as well as the Kolmogorov-Smirnov d test. However, the best way to assess the quality of the fit of a theoretical distribution to an observed distribution is to review the plot of the observed distribution against the theoretical fitted distribution. There are two standard types of plots used for this purpose: Quantile-quantile plots and probability-probability plots.

Quantile-quantile plots. In quantile-quantile plots (or Q-Q plots for short), the observed values of a variable are plotted against the theoretical quantiles. To produce a Q-Q plot, you first sort the n observed data points into ascending order, so that:

x1 x2 xn

These observed values are plotted against one axis of the graph; on the other axis the plot will show:

where i is the rank of the respective observation, radj and nadj are adjustment factors ( 0.5) and F-1 denotes the inverse of the probability integral for the respective standardized distribution. The resulting plot (see example below) is a scatterplot of the observed values against the (standardized) expected values, given the respective distribution. Note that, in addition to the inverse probability integral value, you can also show the respective cumulative probability values on the opposite axis, that is, the plot will show not only the standardized values for the theoretical distribution, but also the respective p-values.

A good fit of the theoretical distribution to the observed values would be indicated by this plot if the plotted values fall onto a straight line. Note that the adjustment factors radj and nadj ensure that the p-value for the inverse probability integral will fall between 0 and 1, but not including 0 and 1 (see Chambers, Cleveland, Kleiner, and Tukey, 1983).

Probability-probability plots. In probability-probability plots (or P-P plots for short) the observed cumulative distribution function is plotted against the theoretical cumulative distribution function. As in the Q-Q plot, the values of the respective variable are first sorted into ascending order. The i‘th observation is plotted against one axis as i/n (i.e., the observed cumulative distribution function), and against the other axis as F(x(i)), where F(x(i)) stands for the value of the theoretical cumulative distribution function for the respective observation x(i). If the theoretical cumulative distribution approximates the observed distribution well, then all points in this plot should fall onto the diagonal line (as in the graph below).

### Non-Normal Process Capability Indices (Percentile Method)

As described earlier, process capability indices are generally computed to evaluate the quality of a process, that is, to estimate the relative range of the items manufactured by the process (process width) with regard to the engineering specifications. For the standard, normal-distribution-based, process capability indices, the process width is typically defined as 6 times sigma, that is, as plus/minus 3 times the estimated process standard deviation. For the standard normal curve, these limits (zl = -3 and zu = +3) translate into the 0.135 percentile and 99.865 percentile, respectively. In the non-normal case, the 3 times sigma limits as well as the mean (zM = 0.0) can be replaced by the corresponding standard values, given the same percentiles, under the non-normal curve. This procedure is described in detail by Clements (1989).

Process capability indices. Shown below are the formulas for the non-normal process capability indices:

Cp = (USL-LSL)/(Up-Lp)

CpL = (M-LSL)/(M-Lp)

CpU = (USL-M)/(Up-M)

Cpk = Min(CpU, CpL)

In these equations, M represents the 50’th percentile value for the respective fitted distribution, and Up and Lp are the 99.865 and .135 percentile values, respectively, if the computations are based on a process width of ±3 times sigma. Note that the values for Up and Lp may be different, if the process width is defined by different sigma limits (e.g., ±2 times sigma).

## Weibull and Reliability/Failure Time Analysis

A key aspect of product quality is product reliability. A number of specialized techniques have been developed to quantify reliability and to estimate the “life expectancy” of a product. Standard references and textbooks describing these techniques include Lawless (1982), Nelson (1990), Lee (1980, 1992), and Dodson (1994); the relevant functions of the Weibull distribution (hazard, CDF, reliability) are also described in the Weibull CDF, reliability, and hazard functions section. Note that very similar statistical procedures are used in the analysis of survival data (see also the description of Survival Analysis), and, for example, the descriptions in Lee’s book (Lee, 1992) are primarily addressed to biomedical research applications. An excellent overview with many examples of engineering applications is provided by Dodson (1994).

### General Purpose

The reliability of a product or component constitutes an important aspect of product quality. Of particular interest is the quantification of a product’s reliability, so that one can derive estimates of the product’s expected useful life. For example, suppose you are flying a small single engine aircraft. It would be very useful (in fact vital) information to know what the probability of engine failure is at different stages of the engine’s “life” (e.g., after 500 hours of operation, 1000 hours of operation, etc.). Given a good estimate of the engine’s reliability, and the confidence limits of this estimate, one can then make a rational decision about when to swap or overhaul the engine.

### The Weibull Distribution

A useful general distribution for describing failure time data is the Weibull distribution (see also Weibull CDF, reliability, and hazard functions). The distribution is named after the Swedish professor Waloddi Weibull, who demonstrated the appropriateness of this distribution for modeling a wide variety of different data sets (see also Hahn and Shapiro, 1967; for example, the Weibull distribution has been used to model the life times of electronic components, relays, ball bearings, or even some businesses).

Hazard function and the bathtub curve. It is often meaningful to consider the function that describes the probability of failure during a very small time increment (assuming that no failures have occurred prior to that time). This function is called the hazard function (or, sometimes, also the conditional failure, intensity, or force of mortality function), and is generally defined as:

h(t) = f(t)/(1-F(t))

where h(t) stands for the hazard function (of time t), and f(t) and F(t) are the probability density and cumulative distribution functions, respectively. The hazard (conditional failure) function for most machines (components, devices) can best be described in terms of the “bathtub” curve: Very early during the life of a machine, the rate of failure is relatively high (so-called Infant Mortality Failures); after all components settle, and the electronic parts are burned in, the failure rate is relatively constant and low. Then, after some time of operation, the failure rate again begins to increase (so-called Wear-out Failures), until all components or devices will have failed.

For example, new automobiles often suffer several small failures right after they were purchased. Once these have been “ironed out,” a (hopefully) long relatively trouble-free period of operation will follow. Then, as the car reaches a particular age, it becomes more prone to breakdowns, until finally, after 20 years and 250000 miles, practically all cars will have failed. A typical bathtub hazard function is shown below.

The Weibull distribution is flexible enough for modeling the key stages of this typical bathtub-shaped hazard function. Shown below are the hazard functions for shape parameters c=.5, c=1, c=2, and c=5.

Clearly, the early (“infant mortality”) “phase” of the bathtub can be approximated by a Weibull hazard function with shape parameter c<1; the constant hazard phase of the bathtub can be modeled with a shape parameter c=1, and the final (“wear-out”) stage of the bathtub with c>1.

Cumulative distribution and reliability functions. Once a Weibull distribution (with a particular set of parameters) has been fit to the data, a number of additional important indices and measures can be estimated. For example, you can compute the cumulative distribution function (commonly denoted as F(t)) for the fitted distribution, along with the standard errors for this function. Thus, you can determine the percentiles of the cumulative survival (and failure) distribution, and, for example, predict the time at which a predetermined percentage of components can be expected to have failed.

The reliability function (commonly denoted as R(t)) is the complement to the cumulative distribution function (i.e., R(t)=1-F(t)); the reliability function is also sometimes referred to as the survivorship or survival function (since it describes the probability of not failing or of surviving until a certain time t; e.g., see Lee, 1992). Shown below is the reliability function for the Weibull distribution, for different shape parameters.

For shape parameters less than 1, the reliability decreases sharply very early in the respective product’s life, and then slowly thereafter. For shape parameters greater than 1, the initial drop in reliability is small, and then the reliability drops relatively sharply at some point later in time. The point where all curves intersect is called the characteristic life: regardless of the shape parameter, 63.2 percent of the population will have failed at or before this point (i.e., R(t) = 1-0.632 = .368). This point in time is also equal to the respective scale parameter b of the two-parameter Weibull distribution (with = 0; otherwise it is equal to b+).

The formulas for the Weibull cumulative distribution, reliability, and hazard functions are shown in the Weibull CDF, reliability, and hazard functions section.

### Censored Observations

In most studies of product reliability, not all items in the study will fail. In other words, by the end of the study the researcher only knows that a certain number of items have not failed for a particular amount of time, but has no knowledge of the exact failure times (i.e., “when the items would have failed”). Those types of data are called censored observations. The issue of censoring, and several methods for analyzing censored data sets, are also described in great detail in the context of Survival Analysis. Censoring can occur in many different ways.

Type I and II censoring. So-called Type I censoring describes the situation when a test is terminated at a particular point in time, so that the remaining items are only known not to have failed up to that time (e.g., we start with 100 light bulbs, and terminate the experiment after a certain amount of time). In this case, the censoring time is often fixed, and the number of items failing is a random variable. In Type II censoring the experiment would be continued until a fixed proportion of items have failed (e.g., we stop the experiment after exactly 50 light bulbs have failed). In this case, the number of items failing is fixed, and time is the random variable.

Left and right censoring. An additional distinction can be made to reflect the “side” of the time dimension at which censoring occurs. In the examples described above, the censoring always occurred on the right side (right censoring), because the researcher knows when exactly the experiment started, and the censoring always occurs on the right side of the time continuum. Alternatively, it is conceivable that the censoring occurs on the left side (left censoring). For example, in biomedical research one may know that a patient entered the hospital at a particular date, and that s/he survived for a certain amount of time thereafter; however, the researcher does not know when exactly the symptoms of the disease first occurred or were diagnosed.

Single and multiple censoring. Finally, there are situations in which censoring can occur at different times (multiple censoring), or only at a particular point in time (single censoring). To return to the light bulb example, if the experiment is terminated at a particular point in time, then a single point of censoring exists, and the data set is said to be single-censored. However, in biomedical research multiple censoring often exists, for example, when patients are discharged from a hospital after different amounts (times) of treatment, and the researcher knows that the patient survived up to those (differential) points of censoring.

The methods described in this section are applicable primarily to right censoring, and single- as well as multiple-censored data.

### Two- and Three-Parameter Weibull Distribution

The Weibull distribution is bounded on the left side. If you look at the probability density function, you can see that that the term x- must be greater than 0. In most cases, the location parameter (theta) is known (usually 0): it identifies the smallest possible failure time. However, sometimes the probability of failure of an item is 0 (zero) for some time after a study begins, and in that case it may be necessary to estimate a location parameter that is greater than 0. There are several methods for estimating the location parameter of the three-parameter Weibull distribution. To identify situations when the location parameter is greater than 0, Dodson (1994) recommends to look for downward of upward sloping tails on a probability plot (see below), as well as large (>6) values for the shape parameter after fitting the two-parameter Weibull distribution, which may indicate a non-zero location parameter.

### Parameter Estimation

Maximum likelihood estimation. Standard iterative function minimization methods can be used to compute maximum likelihood parameter estimates for the two- and three-parameter Weibull distribution. The specific methods for estimating the parameters are described in Dodson (1994); a detailed description of a Newton-Raphson iterative method for estimating the maximum likelihood parameters for the two-parameter distribution is provided in Keats and Lawrence (1997).

The estimation of the location parameter for the three-parameter Weibull distribution poses a number of special problems, which are detailed in Lawless (1982). Specifically, when the shape parameter is less than 1, then a maximum likelihood solution does not exist for the parameters. In other instances, the likelihood function may contain more than one maximum (i.e., multiple local maxima). In the latter case, Lawless basically recommends using the smallest failure time (or a value that is a little bit less) as the estimate of the location parameter.

Nonparametric (rank-based) probability plots. One can derive a descriptive estimate of the cumulative distribution function (regardless of distribution) by first rank-ordering the observations, and then computing any of the following expressions:

Median rank:

F(t) = (j-0.3)/(n+0.4)

Mean rank:

F(t) = j/(n+1)

White’s plotting position:

F(t) = (j-3/8)/(n+1/4)

where j denotes the failure order (rank; for multiple-censored data a weighted average ordered failure is computed; see Dodson, p. 21), and n is the total number of observations. One can then construct the following plot.

Note that the horizontal Time axis is scaled logarithmically; on the vertical axis the quantity log(log(100/(100-F(t))) is plotted (a probability scale is shown on the left-y axis). From this plot the parameters of the two-parameter Weibull distribution can be estimated; specifically, the shape parameter is equal to the slope of the linear fit-line, and the scale parameter can be estimated as exp(-intercept/slope).

Estimating the location parameter from probability plots. It is apparent in the plot shown above that the regression line provides a good fit to the data. When the location parameter is misspecified (e.g., not equal to zero), then the linear fit is worse as compared to the case when it is appropriately specified. Therefore, one can compute the probability plot for several values of the location parameter, and observe the quality of the fit. These computations are summarized in the following plot.

Here the common R-square measure (correlation squared) is used to express the quality of the linear fit in the probability plot, for different values of the location parameter shown on the horizontal x axis (this plot is based on the example data set in Dodson, 1994, Table 2.9). This plot is often very useful when the maximum likelihood estimation procedure for the three-parameter Weibull distribution fails, because it shows whether or not a unique (single) optimum value for the location parameter exists (as in the plot shown above).

Hazard plotting. Another method for estimating the parameters for the two-parameter Weibull distribution is via hazard plotting (as discussed earlier, the hazard function describes the probability of failure during a very small time increment, assuming that no failures have occurred prior to that time). This method is very similar to the probability plotting method. First plot the cumulative hazard function against the logarithm of the survival times; then fit a linear regression line and compute the slope and intercept of that line. As in probability plotting, the shape parameter can then be estimated as the slope of the regression line, and the scale parameter as exp(-intercept/slope). See Dodson (1994) for additional details; see also Weibull CDF, reliability, and hazard functions.

Method of moments. This method – to approximate the moments of the observed distribution by choosing the appropriate parameters for the Weibull distribution – is also widely described in the literature. In fact, this general method is used for fitting the Johnson curves general non-normal distribution to the data, to compute non-normal process capability indices (see Fitting Distributions by Moments). However, the method is not suited for censored data sets, and is therefore not very useful for the analysis of failure time data.

Comparing the estimation methods. Dodson (1994) reports the result of a Monte Carlo simulation study, comparing the different methods of estimation. In general, the maximum likelihood estimates proved to be best for large sample sizes (e.g., n>15), while probability plotting and hazard plotting appeared to produce better (more accurate) estimates for smaller samples.

A note of caution regarding maximum likelihood based confidence limits. Many software programs will compute confidence intervals for maximum likelihood estimates, and for the reliability function based on the standard errors of the maximum likelihood estimates. Dodson (1994) cautions against the interpretation of confidence limits computed from maximum likelihood estimates, or more precisely, estimates that involve the information matrix for the estimated parameters. When the shape parameter is less than 2, the variance estimates computed for maximum likelihood estimates lack accuracy, and it is advisable to compute the various results graphs based on nonparametric confidence limits as well.

### Goodness of Fit Indices

A number of different tests have been proposed for evaluating the quality of the fit of the Weibull distribution to the observed data. These tests are discussed and compared in detail in Lawless (1982).

Hollander-Proschan. This test compares the theoretical reliability function to the Kaplan-Meier estimate. The actual computations for this test are somewhat complex, and you may refer to Dodson (1994, Chapter 4) for a detailed description of the computational formulas. The Hollander-Proschan test is applicable to complete, single-censored, and multiple-censored data sets; however, Dodson (1994) cautions that the test may sometimes indicate a poor fit when the data are heavily single-censored. The Hollander-Proschan C statistic can be tested against the normal distribution (z).

Mann-Scheuer-Fertig. This test, proposed by Mann, Scheuer, and Fertig (1973), is described in detail in, for example, Dodson (1994) or Lawless (1982). The null hypothesis for this test is that the population follows the Weibull distribution with the estimated parameters. Nelson (1982) reports this test to have reasonably good power, and this test can be applied to Type II censored data. For computational details refer to Dodson (1994) or Lawless (1982); the critical values for the test statistic have been computed based on Monte Carlo studies, and have been tabulated for n (sample sizes) between 3 and 25.

Anderson-Darling. The Anderson-Darling procedure is a general test to compare the fit of an observed cumulative distribution function to an expected cumulative distribution function. However, this test is only applicable to complete data sets (without censored observations). The critical values for the Anderson-Darling statistic have been tabulated (see, for example, Dodson, 1994, Table 4.4) for sample sizes between 10 and 40; this test is not computed for n less than 10 and greater than 40.

### Interpreting Results

Once a satisfactory fit of the Weibull distribution to the observed failure time data has been obtained, there are a number of different plots and tables that are of interest to understand the reliability of the item under investigation. If a good fit for the Weibull cannot be established, distribution-free reliability estimates (and graphs) should be reviewed to determine the shape of the reliability function.

Reliability plots. This plot will show the estimated reliability function along with the confidence limits.

Note that nonparametric (distribution-free) estimates and their standard errors can also be computed and plotted.

Hazard plots. As mentioned earlier, the hazard function describes the probability of failure during a very small time increment (assuming that no failures have occurred prior to that time). The plot of hazard as a function of time gives valuable information about the conditional failure probability.

Percentiles of the reliability function. Based on the fitted Weibull distribution, one can compute the percentiles of the reliability (survival) function, along with the confidence limits for these estimates (for maximum likelihood parameter estimates). These estimates are particularly valuable for determining the percentages of items that can be expected to have failed at particular points in time.

### Grouped Data

In some cases, failure time data are presented in grouped form. Specifically, instead of having available the precise failure time for each observation, only aggregate information is available about the number of items that failed or were censored in a particular time interval. Such life-table data input is also described in the context of the Survival Analysis chapter. There are two general approaches for fitting the Weibull distribution to grouped data.

First, one can treat the tabulated data as if they were continuous. In other words, one can “expand” the tabulated values into continuous data by assuming (1) that each observation in a given time interval failed exactly at the interval mid-point (interpolating out “half a step” for the last interval), and (2) that censoring occurred after the failures in each interval (in other words, censored observations are sorted after the observed failures). Lawless (1982) advises that this method is usually satisfactory if the class intervals are relatively narrow.

Alternatively, you may treat the data explicitly as a tabulated life table, and use a weighted least squares methods algorithm (based on Gehan and Siddiqui, 1973; see also Lee, 1992) to fit the Weibull distribution (Lawless, 1982, also describes methods for computing maximum likelihood parameter estimates from grouped data).

### Modified Failure Order for Multiple-Censored Data

For multiple-censored data a weighted average ordered failure is calculated for each failure after the first censored data point. These failure orders are then used to compute the median rank, to estimate the cumulative distribution function.

The modified failure order j is computed as (see Dodson 1994):

Ij = ((n+1)-Op)/(1+c)

where:

Ij      is the increment for the j’th failure
n      is the total number of data points
Op   is the failure order of the previous observation (and Oj = Op + Ij)
c      is the number of data points remaining in the data set, including the current data point

The median rank is then computed as:

F(t) = (Ij -0.3)/(n+0.4)

where Ij denotes the modified failure order, and n is the total number of observations.

### Weibull CDF, Reliability, and Hazard

Density function. The Weibull distribution (Weibull, 1939, 1951; see also Lieblein, 1955) has density function (for positive parameters b, c, and ):

f(x) = c/b*[(x-)/b]c-1 * e^{-[(x-)/b]c}
< x,  b > 0,  c > 0

where
b     is the scale parameter of the distribution
c     is the shape parameter of the distribution
is the location parameter of the distribution
e     is the base of the natural logarithm, sometimes called Euler’s e (2.71…)

Cumulative distribution function (CDF). The Weibull distribution has the cumulative distribution function (for positive parameters b, c, and ):

F(x) = 1 – exp{-[(x-)/b]c}

using the same notation and symbols as described above for the density function.

Reliability function. The Weibull reliability function is the complement of the cumulative distribution function:

R(x) = 1 – F(x)

Hazard function. The hazard function describes the probability of failure during a very small time increment, assuming that no failures have occurred prior to that time. The Weibull distribution has the hazard function (for positive parameters b, c, and ):

h(t) = f(t)/R(t) = [c*(x-)(c-1)] / bc

using the same notation and symbols as described above for the density and reliability functions.

Cumulative hazard function. The Weibull distribution has the cumulative hazard function (for positive parameters b, c, and ):

H(t) = (x-) / bc

using the same notation and symbols as described above for the density and reliability functions.

# by third party report

The non-profit Electric Power Research Institute (EPRI) recently conducted a study of the StatSoft technology to determine its suitability for optimizing the performance (heat-rate, emissions, LOI) in an older coal-fired power plant. EPRI ordered from StatSoft an optimization project to be conducted under scrutiny of their inspectors.

Using nine months worth of detailed 6-minute interval data describing more than 140 parameters of the process, EPRI found that process data analysis using STATISTICA is a cost-effective solution for optimizing the use of current process hardware to save cost and reduce emissions.

## Overview of the Approach

StatSoft Power Solutions offer solution packages designed for utility companies, for optimizing power plant performance, increasing the efficiency, and reducing emissions. Based on over 20 years of experience in applying advanced data-driven, data mining optimization technologies for process optimization in various industries, these solutions will allow power plants to get the most out of their equipment and control systems, by leveraging all data collected at your site to identify opportunities for improvement, even for older designs such as coal-fired Cyclone furnaces (as well as wall-fired or T-fired designs).

### Opportunities for Data Driven Strategies to Improve Powerplant Performance

Many (most) power generation facilities are collecting “lots of data” into dedicated historical process data bases. (such as OSI PI) However, in most cases, only simple charts and “after-the-fact” ad-hoc analyses are performed on a small subset of those data; most information is simply not used.

For example, for coal fired power plants, our solutions can help you identify optimum settings for stoichiometric ratio, primary/tertiary air flows, secondary air biases, distribution of OFA (overfired air), burner tilts and yaw positions, and other controllable parameters to reduce NOx, CO, and LOI, without requiring any re-engineering of existing hardware.

### What is Data Mining? Why Data Mining?

Data Mining is the term used to describe the application of various machine learning and/or pattern recognition algorithms and techniques, to identify complex relationships among observed variables. These techniques can reveal invaluable insights when the data contain meaningful information which is “hidden” deep inside your data set, and cannot be identified with simple methods. Advanced data mining can reveal those insights by processing many variables and complex interrelations between them, all at the same time.

Unlike CFD, data mining allows you to model the “real world” from “real data,” describing your specific plant. Using this approach, you can:

• Identify from among hundreds or even thousands of input parameters those that are critical for low-emissions efficient operations
• Determine the ranges for those parameters, and combinations of parameter ranges that will result in robust and stable low-emissions operations, without costly excursions (high-emissions events, unscheduled maintenance and expensive generation roll-backs).

These results can be implemented using your existing closed-or-open loop control system to achieve sustained improvements in power plant performance, or you can use StatSoft MultiStream to create a state-of-the-art advanced process monitoring system to achieve permanent improvements.

### How is this Different from “Neural Nets” for Closed Loop Control?

One frequently asked question is: How do these solutions differ from neural networks based computer programs that can control critical power plant operations in a closed loop system (an approach used at some plants, often with less than expected success)?

The answer is that, because those systems are based on relatively simple, traditional neural networks technology which typically can only simultaneously process relatively few parameters, they are not capable of identifying the important parameters from among hundreds of possible candidates, and they will not identify specific combinations of parameter ranges (“sweet spots”) that make overall power plant operations more robust.

The cutting-edge technologies developed by StatSoft Power Solutions will not simply implement a cookie-cutter approach to use a few parameters common to all power plants to achieve some (usually only very modest) overall process performance improvements. Instead, our approach allows you to take a fresh look at all your data and operations, to optimize them for best performance. This will allow you to focus your process monitoring efforts, operator training, or automation initiatives only on those parameters that actually drive boiler efficiency, emissions, and so on at your plant and for your equipment.

What we are offering is not simply another neural net for closed loop control; instead, it provides flexible tools based on cutting-edge data processing technologies to optimize all systems, and also provides smart monitoring and advisory options capable of predicting problems, such as emissions related to combustion optimization or maintenance issues.

Contact StatSoft Southern Africa for more information about our services, software solutions, and recent success stories. lorraine@statsoft.co.za

## Featured Article: Full Rexer Report Shows StatSoft STATISTICA’s a Big Winner Among Users!

Featured Article: Full Rexer Report Shows STATISTICA’s a Big Winner Among Users

Rexer Analytics has just released the full summary report of its 5th Annual Data Miner Survey, and we are excited to share that StatSoft’s STATISTICA blew away the competition.

Not only was STATISTICA the primary data mining tool chosen most often by users, but STATISTICA received the highest user satisfaction in 16 out of 20 categories as shown here:

www.rexeranalytics.com

Has my university/institution got a license?

LORRAINE EDEL

StatSoft Southern Africa Research

Tel:    011 234-6148

Fax:    086 544-1172

Cell:   082 5678 330

Powerful solutions for:

* Data mining

* Research Analytics

* Quality control

## General Linear Models (GLM)

This topic describes the use of the general linear model in a wide variety of statistical analyses. If you are unfamiliar with the basic methods of ANOVA and regression in linear models, it may be useful to first review the basic information on these topics in Elementary Concepts. A detailed discussion of univariate and multivariate ANOVA techniques can also be found in the ANOVA/MANOVA topic.

## Basic Ideas: The General Linear Model

The following topics summarize the historical, mathematical, and computational foundations for the general linear model. For a basic introduction to ANOVA (MANOVA, ANCOVA) techniques, refer to ANOVA/MANOVA; for an introduction to multiple regression, see Multiple Regression; for an introduction to the design an analysis of experiments in applied (industrial) settings, see Experimental Design.

### Historical Background

The roots of the general linear model surely go back to the origins of mathematical thought, but it is the emergence of the theory of algebraic invariants in the 1800’s that made the general linear model, as we know it today, possible. The theory of algebraic invariants developed from the groundbreaking work of 19th century mathematicians such as Gauss, Boole, Cayley, and Sylvester. The theory seeks to identify those quantities in systems of equations which remain unchanged under linear transformations of the variables in the system. Stated more imaginatively (but in a way in which the originators of the theory would not consider an overstatement), the theory of algebraic invariants searches for the eternal and unchanging amongst the chaos of the transitory and the illusory. That is no small goal for any theory, mathematical or otherwise.

The wonder of it all is the theory of algebraic invariants was successful far beyond the hopes of its originators. Eigenvalues, eigenvectors, determinants, matrix decomposition methods; all derive from the theory of algebraic invariants. The contributions of the theory of algebraic invariants to the development of statistical theory and methods are numerous, but a simple example familiar to even the most casual student of statistics is illustrative. The correlation between two variables is unchanged by linear transformations of either or both variables. We probably take this property of correlation coefficients for granted, but what would data analysis be like if we did not have statistics that are invariant to the scaling of the variables involved? Some thought on this question should convince you that without the theory of algebraic invariants, the development of useful statistical techniques would be nigh impossible.

The development of the linear regression model in the late 19th century, and the development of correlational methods shortly thereafter, are clearly direct outgrowths of the theory of algebraic invariants. Regression and correlational methods, in turn, serve as the basis for the general linear model. Indeed, the general linear model can be seen as an extension of linear multiple regression for a single dependent variable. Understanding the multiple regression model is fundamental to understanding the general linear model, so we will look at the purpose of multiple regression, the computational algorithms used to solve regression problems, and how the regression model is extended in the case of the general linear model. A basic introduction to multiple regression methods and the analytic problems to which they are applied is provided in the Multiple Regression.

### The Purpose of Multiple Regression

The general linear model can be seen as an extension of linear multiple regression for a single dependent variable, and understanding the multiple regression model is fundamental to understanding the general linear model. The general purpose of multiple regression (the term was first used by Pearson, 1908) is to quantify the relationship between several independent or predictor variables and a dependent or criterion variable. For a detailed introduction to multiple regression, also refer to the Multiple Regression section. For example, a real estate agent might record for each listing the size of the house (in square feet), the number of bedrooms, the average income in the respective neighborhood according to census data, and a subjective rating of appeal of the house. Once this information has been compiled for various houses it would be interesting to see whether and how these measures relate to the price for which a house is sold. For example, we might learn that the number of bedrooms is a better predictor of the price for which a house sells in a particular neighborhood than how “pretty” the house is (subjective rating). We may also detect “outliers,” for example, houses that should really sell for more, given their location and characteristics.

Personnel professionals customarily use multiple regression procedures to determine equitable compensation. We can determine a number of factors or dimensions such as “amount of responsibility” (Resp) or “number of people to supervise” (No_Super) that we believe to contribute to the value of a job. The personnel analyst then usually conducts a salary survey among comparable companies in the market, recording the salaries and respective characteristics (i.e., values on dimensions) for different positions. This information can be used in a multiple regression analysis to build a regression equation of the form:

Salary = .5*Resp + .8*No_Super

Once this so-called regression equation has been determined, the analyst can now easily construct a graph of the expected (predicted) salaries and the actual salaries of job incumbents in his or her company. Thus, the analyst is able to determine which position is underpaid (below the regression line) or overpaid (above the regression line), or paid equitably.

In the social and natural sciences multiple regression procedures are very widely used in research. In general, multiple regression allows the researcher to ask (and hopefully answer) the general question “what is the best predictor of …”. For example, educational researchers might want to learn what are the best predictors of success in high-school. Psychologists may want to determine which personality variable best predicts social adjustment. Sociologists may want to find out which of the multiple social indicators best predict whether or not a new immigrant group will adapt and be absorbed into society.

### Computations for Solving the Multiple Regression Equation

A one-dimensional surface in a two-dimensional or two-variable space is a line defined by the equation Y = b0 + b1X. According to this equation, the Y variable can be expressed in terms of or as a function of a constant (b0) and a slope (b1) times the X variable. The constant is also referred to as the intercept, and the slope as the regression coefficient. For example, GPA may best be predicted as 1+.02*IQ. Thus, knowing that a student has an IQ of 130 would lead us to predict that her GPA would be 3.6 (since, 1+.02*130=3.6). In the multiple regression case, when there are multiple predictor variables, the regression surface usually cannot be visualized in a two dimensional space, but the computations are a straightforward extension of the computations in the single predictor case. For example, if in addition to IQ we had additional predictors of achievement (e.g., Motivation, Self-discipline) we could construct a linear equation containing all those variables. In general then, multiple regression procedures will estimate a linear equation of the form:

Y = b0 + b1X1 + b2X2 + … + bkXk

where k is the number of predictors. Note that in this equation, the regression coefficients (or b1bk coefficients) represent the independent contributions of each in dependent variable to the prediction of the dependent variable. Another way to express this fact is to say that, for example, variable X1 is correlated with the Y variable, after controlling for all other independent variables. This type of correlation is also referred to as a partial correlation (this term was first used by Yule, 1907). Perhaps the following example will clarify this issue. We would probably find a significant negative correlation between hair length and height in the population (i.e., short people have longer hair). At first this may seem odd; however, if we were to add the variable Gender into the multiple regression equation, this correlation would probably disappear. This is because women, on the average, have longer hair than men; they also are shorter on the average than men. Thus, after we remove this gender difference by entering Gender into the equation, the relationship between hair length and height disappears because hair length does not make any unique contribution to the prediction of height, above and beyond what it shares in the prediction with variable Gender. Put another way, after controlling for the variable Gender, the partial correlation between hair length and height is zero.

The regression surface (a line in simple regression, a plane or higher-dimensional surface in multiple regression) expresses the best prediction of the dependent variable (Y), given the independent variables (X‘s). However, nature is rarely (if ever) perfectly predictable, and usually there is substantial variation of the observed points from the fitted regression surface. The deviation of a particular point from the nearest corresponding point on the predicted regression surface (its predicted value) is called the residual value. Since the goal of linear regression procedures is to fit a surface, which is a linear function of the X variables, as closely as possible to the observed Y variable, the residual values for the observed points can be used to devise a criterion for the “best fit.” Specifically, in regression problems the surface is computed for which the sum of the squared deviations of the observed points from that surface are minimized. Thus, this general procedure is sometimes also referred to as least squares estimation. (see also the description of weighted least squares estimation).

The actual computations involved in solving regression problems can be expressed compactly and conveniently using matrix notation. Suppose that there are n observed values of Y and n associated observed values for each of k different X variables. Then Yi, Xik, and ei can represent the ith observation of the Y variable, the ith observation of each of the X variables, and the ith unknown residual value, respectively. Collecting these terms into matrices we have

The multiple regression model in matrix notation then can be expressed as

Y = Xb + e

where b is a column vector of 1 (for the intercept) + k unknown regression coefficients. Recall that the goal of multiple regression is to minimize the sum of the squared residuals. Regression coefficients that satisfy this criterion are found by solving the set of normal equations

X’Xb = X’Y

When the X variables are linearly independent (i.e., they are nonredundant, yielding an X’X matrix which is of full rank) there is a unique solution to the normal equations. Premultiplying both sides of the matrix formula for the normal equations by the inverse of X’X gives

(X’X)-1X’Xb = (X’X)-1X’Y

or

b = (X’X)-1X’Y

This last result is very satisfying in view of its simplicity and its generality. With regard to its simplicity, it expresses the solution for the regression equation in terms just 2 matrices (X and Y) and 3 basic matrix operations, (1) matrix transposition, which involves interchanging the elements in the rows and columns of a matrix, (2) matrix multiplication, which involves finding the sum of the products of the elements for each row and column combination of two conformable (i.e., multipliable) matrices, and (3) matrix inversion, which involves finding the matrix equivalent of a numeric reciprocal, that is, the matrix that satisfies

A-1AA=A

for a matrix A.

It took literally centuries for the ablest mathematicians and statisticians to find a satisfactory method for solving the linear least square regression problem. But their efforts have paid off, for it is hard to imagine a simpler solution.

With regard to the generality of the multiple regression model, its only notable limitations are that (1) it can be used to analyze only a single dependent variable, (2) it cannot provide a solution for the regression coefficients when the X variables are not linearly independent and the inverse of X’X therefore does not exist. These restrictions, however, can be overcome, and in doing so the multiple regression model is transformed into the general linear model.

### Extension of Multiple Regression to the General Linear Model

One way in which the general linear model differs from the multiple regression model is in terms of the number of dependent variables that can be analyzed. The Y vector of n observations of a single Y variable can be replaced by a Y matrix of n observations of m different Y variables. Similarly, the b vector of regression coefficients for a single Y variable can be replaced by a b matrix of regression coefficients, with one vector of b coefficients for each of the m dependent variables. These substitutions yield what is sometimes called the multivariate regression model, but it should be emphasized that the matrix formulations of the multiple and multivariate regression models are identical, except for the number of columns in the Y and b matrices. The method for solving for the b coefficients is also identical, that is, m different sets of regression coefficients are separately found for the m different dependent variables in the multivariate regression model.

The general linear model goes a step beyond the multivariate regression model by allowing for linear transformations or linear combinations of multiple dependent variables. This extension gives the general linear model important advantages over the multiple and the so-called multivariate regression models, both of which are inherently univariate (single dependent variable) methods. One advantage is that multivariate tests of significance can be employed when responses on multiple dependent variables are correlated. Separate univariate tests of significance for correlated dependent variables are not independent and may not be appropriate. Multivariate tests of significance of independent linear combinations of multiple dependent variables also can give insight into which dimensions of the response variables are, and are not, related to the predictor variables. Another advantage is the ability to analyze effects of repeated measure factors. Repeated measure designs, or within-subject designs, have traditionally been analyzed using ANOVA techniques. Linear combinations of responses reflecting a repeated measure effect (for example, the difference of responses on a measure under differing conditions) can be constructed and tested for significance using either the univariate or multivariate approach to analyzing repeated measures in the general linear model.

A second important way in which the general linear model differs from the multiple regression model is in its ability to provide a solution for the normal equations when the X variables are not linearly independent and the inverse of X’X does not exist. Redundancy of the X variables may be incidental (e.g., two predictor variables might happen to be perfectly correlated in a small data set), accidental (e.g., two copies of the same variable might unintentionally be used in an analysis) or designed (e.g., indicator variables with exactly opposite values might be used in the analysis, as when both Male and Female predictor variables are used in representing Gender). Finding the regular inverse of a non-full-rank matrix is reminiscent of the problem of finding the reciprocal of 0 in ordinary arithmetic. No such inverse or reciprocal exists because division by 0 is not permitted. This problem is solved in the general linear model by using a generalized inverse of the X’X matrix in solving the normal equations. A generalized inverse is any matrix that satisfies

AAA = A

for a matrix A. A generalized inverse is unique and is the same as the regular inverse only if the matrix A is full rank. A generalized inverse for a non-full-rank matrix can be computed by the simple expedient of zeroing the elements in redundant rows and columns of the matrix. Suppose that an X’X matrix with r non-redundant columns is partitioned as

where A11 is an r by r matrix of rank r. Then the regular inverse of A11 exists and a generalized inverse of X’X is

where each 0 (null) matrix is a matrix of 0’s (zeroes) and has the same dimensions as the corresponding A matrix.

In practice, however, a particular generalized inverse of X’X for finding a solution to the normal equations is usually computed using the sweep operator (Dempster, 1960). This generalized inverse, called a g2 inverse, has two important properties. One is that zeroing of the elements in redundant rows is unnecessary. Another is that partitioning or reordering of the columns of X’X is unnecessary, so that the matrix can be inverted “in place.”

There are infinitely many generalized inverses of a non-full-rank X’X matrix, and thus, infinitely many solutions to the normal equations. This can make it difficult to understand the nature of the relationships of the predictor variables to responses on the dependent variables, because the regression coefficients can change depending on the particular generalized inverse chosen for solving the normal equations. It is not cause for dismay, however, because of the invariance properties of many results obtained using the general linear model.

A simple example may be useful for illustrating one of the most important invariance properties of the use of generalized inverses in the general linear model. If both Male and Female predictor variables with exactly opposite values are used in an analysis to represent Gender, it is essentially arbitrary as to which predictor variable is considered to be redundant (e.g., Male can be considered to be redundant with Female, or vice versa). No matter which predictor variable is considered to be redundant, no matter which corresponding generalized inverse is used in solving the normal equations, and no matter which resulting regression equation is used for computing predicted values on the dependent variables, the predicted values and the corresponding residuals for males and females will be unchanged. In using the general linear model, we must keep in mind that finding a particular arbitrary solution to the normal equations is primarily a means to the end of accounting for responses on the dependent variables, and not necessarily an end in itself.

### Sigma-Restricted and Overparameterized Model

Unlike the multiple regression model, which is usually applied to cases where the X variables are continuous, the general linear model is frequently applied to analyze any ANOVA or MANOVA design with categorical predictor variables, any ANCOVA or MANCOVA design with both categorical and continuous predictor variables, as well as any multiple or multivariate regression design with continuous predictor variables. To illustrate, Gender is clearly a nominal level variable (anyone who attempts to rank order the sexes on any dimension does so at his or her own peril in today’s world). There are two basic methods by which Gender can be coded into one or more (non-offensive) predictor variables, and analyzed using the general linear model.

Sigma-restricted model (coding of categorical predictors). Using the first method, males and females can be assigned any two arbitrary, but distinct values on a single predictor variable. The values on the resulting predictor variable will represent a quantitative contrast between males and females. Typically, the values corresponding to group membership are chosen not arbitrarily but rather to facilitate interpretation of the regression coefficient associated with the predictor variable. In one widely used strategy, cases in the two groups are assigned values of 1 and -1 on the predictor variable, so that if the regression coefficient for the variable is positive, the group coded as 1 on the predictor variable will have a higher predicted value (i.e., a higher group mean) on the dependent variable, and if the regression coefficient is negative, the group coded as -1 on the predictor variable will have a higher predicted value on the dependent variable. An additional advantage is that since each group is coded with a value one unit from zero, this helps in interpreting the magnitude of differences in predicted values between groups, because regression coefficients reflect the units of change in the dependent variable for each unit change in the predictor variable. This coding strategy is aptly called the sigma-restricted parameterization, because the values used to represent group membership (1 and -1) sum to zero.

Note that the sigma-restricted parameterization of categorical predictor variables usually leads to X’X matrices which do not require a generalized inverse for solving the normal equations. Potentially redundant information, such as the characteristics of maleness and femaleness, is literally reduced to full-rank by creating quantitative contrast variables representing differences in characteristics.

Overparameterized model (coding of categorical predictors). The second basic method for recoding categorical predictors is the indicator variable approach. In this method a separate predictor variable is coded for each group identified by a categorical predictor variable. To illustrate, females might be assigned a value of 1 and males a value of 0 on a first predictor variable identifying membership in the female Gender group, and males would then be assigned a value of 1 and females a value of 0 on a second predictor variable identifying membership in the male Gender group. Note that this method of recoding categorical predictor variables will almost always lead to X’X matrices with redundant columns, and thus require a generalized inverse for solving the normal equations. As such, this method is often called the overparameterized model for representing categorical predictor variables, because it results in more columns in the X’X than are necessary for determining the relationships of categorical predictor variables to responses on the dependent variables.

True to its description as general, the general linear model can be used to perform analyses with categorical predictor variables which are coded using either of the two basic methods that have been described.

### Summary of Computations

To conclude this discussion of the ways in which the general linear model extends and generalizes regression methods, the general linear model can be expressed as

YM = Xb + e

Here Y, X, b, and e are as described for the multivariate regression model and M is an m x s matrix of coefficients defining s linear transformation of the dependent variables. The normal equations are

X’Xb = X’YM

and a solution for the normal equations is given by

b = (X’X)X’YM Here the inverse of X’X is a generalized inverse if X’X contains redundant columns.

Add a provision for analyzing linear combinations of multiple dependent variables, add a method for dealing with redundant predictor variables and recoded categorical predictor variables, and the major limitations of multiple regression are overcome by the general linear model.

## Types of Analyses

A wide variety of types of designs can be analyzed using the general linear model. In fact, the flexibility of the general linear model allows it to handle so many different types of designs that it is difficult to develop simple typologies of the ways in which these designs might differ. Some general ways in which designs might differ can be suggested, but keep in mind that any particular design can be a “hybrid” in the sense that it could have combinations of features of a number of different types of designs.

In the following discussion, references will be made to the design matrix X, as well as sigma-restricted and overparameterized model coding. For an explanation of this terminology, refer to the section entitled Basic Ideas: The General Linear Model, or, for a brief summary, to the Summary of computations section.

A basic discussion to univariate and multivariate ANOVA techniques can also be found in the ANOVA/MANOVA topic; a discussion of multiple regression methods is also provided in the Multiple Regression topic.

### Between-Subject Designs

Overview. The levels or values of the predictor variables in an analysis describe the differences between the n subjects or the n valid cases that are analyzed. Thus, when we speak of the between subject design (or simply the between design) for an analysis, we are referring to the nature, number, and arrangement of the predictor variables.

Concerning the nature or type of predictor variables, between designs which contain only categorical predictor variables can be called ANOVA (analysis of variance) designs, between designs which contain only continuous predictor variables can be called regression designs, and between designs which contain both categorical and continuous predictor variables can be called ANCOVA (analysis of covariance) designs. Further, continuous predictors are always considered to have fixed values, but the levels of categorical predictors can be considered to be fixed or to vary randomly. Designs which contain random categorical factors are called mixed-model designs (see the Variance Components and Mixed Model ANOVA/ANCOVA section).

Between designs may involve only a single predictor variable and therefore be described as simple (e.g., simple regression) or may employ numerous predictor variables (e.g., multiple regression).

Concerning the arrangement of predictor variables, some between designs employ only “main effect” or first-order terms for predictors, that is, the values for different predictor variables are independent and raised only to the first power. Other between designs may employ higher-order terms for predictors by raising the values for the original predictor variables to a power greater than 1 (e.g., in polynomial regression designs), or by forming products of different predictor variables (i.e., interaction terms). A common arrangement for ANOVA designs is the full-factorial design, in which every combination of levels for each of the categorical predictor variables is represented in the design. Designs with some but not all combinations of levels for each of the categorical predictor variables are aptly called fractional factorial designs. Designs with a hierarchy of combinations of levels for the different categorical predictor variables are called nested designs.

These basic distinctions about the nature, number, and arrangement of predictor variables can be used in describing a variety of different types of between designs. Some of the more common between designs can now be described.

One-Way ANOVA. A design with a single categorical predictor variable is called a one-way ANOVA design. For example, a study of 4 different fertilizers used on different individual plants could be analyzed via one-way ANOVA, with four levels for the factor Fertilizer.

In genera, consider a single categorical predictor variable A with 1 case in each of its 3 categories. Using the sigma-restricted coding of A into 2 quantitative contrast variables, the matrix X defining the between design is

That is, cases in groups A1, A2, and A3 are all assigned values of 1 on X0 (the intercept), the case in group A1 is assigned a value of 1 on X1 and a value 0 on X2, the case in group A2 is assigned a value of 0 on X1 and a value 1 on X2, and the case in group A3 is assigned a value of -1 on X1 and a value -1 on X2. Of course, any additional cases in any of the 3 groups would be coded similarly. If there were 1 case in group A1, 2 cases in group A2, and 1 case in group A3, the X matrix would be

where the first subscript for A gives the replicate number for the cases in each group. For brevity, replicates usually are not shown when describing ANOVA design matrices.

Note that in one-way designs with an equal number of cases in each group, sigma-restricted coding yields X1 … Xk variables all of which have means of 0.

Using the overparameterized model to represent A, the X matrix defining the between design is simply

These simple examples show that the X matrix actually serves two purposes. It specifies (1) the coding for the levels of the original predictor variables on the X variables used in the analysis as well as (2) the nature, number, and arrangement of the X variables, that is, the between design.

Main Effect ANOVA. Main effect ANOVA designs contain separate one-way ANOVA designs for 2 or more categorical predictors. A good example of main effect ANOVA would be the typical analysis performed on screening designs as described in the context of the Experimental Design section.

Consider 2 categorical predictor variables A and B each with 2 categories. Using the sigma-restricted coding, the X matrix defining the between design is

Note that if there are equal numbers of cases in each group, the sum of the cross-products of values for the X1 and X2 columns is 0, for example, with 1 case in each group (1*1)+(1*-1)+(-1*1)+(-1*-1)=0. Using the overparameterized model, the matrix X defining the between design is

Comparing the two types of coding, it can be seen that the overparameterized coding takes almost twice as many values as the sigma-restricted coding to convey the same information.

Factorial ANOVA. Factorial ANOVA designs contain X variables representing combinations of the levels of 2 or more categorical predictors (e.g., a study of boys and girls in four age groups, resulting in a 2 (Gender) x 4 (Age Group) design). In particular, full-factorial designs represent all possible combinations of the levels of the categorical predictors. A full-factorial design with 2 categorical predictor variables A and B each with 2 levels each would be called a 2 x 2 full-factorial design. Using the sigma-restricted coding, the X matrix for this design would be

Several features of this X matrix deserve comment. Note that the X1 and X2 columns represent main effect contrasts for one variable, (i.e., A and B, respectively) collapsing across the levels of the other variable. The X3 column instead represents a contrast between different combinations of the levels of A and B. Note also that the values for X3 are products of the corresponding values for X1 and X2. Product variables such as X3 represent the multiplicative or interaction effects of their factors, so X3 would be said to represent the 2-way interaction of A and B. The relationship of such product variables to the dependent variables indicate the interactive influences of the factors on responses above and beyond their independent (i.e., main effect) influences on responses. Thus, factorial designs provide more information about the relationships between categorical predictor variables and responses on the dependent variables than is provided by corresponding one-way or main effect designs.

When many factors are being investigated, however, full-factorial designs sometimes require more data than reasonably can be collected to represent all possible combinations of levels of the factors, and high-order interactions between many factors can become difficult to interpret. With many factors, a useful alternative to the full-factorial design is the fractional factorial design. As an example, consider a 2 x 2 x 2 fractional factorial design to degree 2 with 3 categorical predictor variables each with 2 levels. The design would include the main effects for each variable, and all 2-way interactions between the three variables, but would not include the 3-way interaction between all three variables. Using the overparameterized model, the X matrix for this design is

The 2-way interactions are the highest degree effects included in the design. These types of designs are discussed in detail the 2**(k-p) Fractional Factorial Designs section of the Experimental Design topic.

Nested ANOVA Designs. Nested designs are similar to fractional factorial designs in that all possible combinations of the levels of the categorical predictor variables are not represented in the design. In nested designs, however, the omitted effects are lower-order effects. Nested effects are effects in which the nested variables never appear as main effects. Suppose that for 2 variables A and B with 3 and 2 levels, respectively, the design includes the main effect for A and the effect of B nested within the levels of A. The X matrix for this design using the overparameterized model is

Note that if the sigma-restricted coding were used, there would be only 2 columns in the X matrix for the B nested within A effect instead of the 6 columns in the X matrix for this effect when the overparameterized model coding is used (i.e., columns X4 through X9). The sigma-restricted coding method is overly-restrictive for nested designs, so only the overparameterized model is used to represent nested designs.

Balanced ANOVA. Most of the between designs discussed in this section can be analyzed much more efficiently, when they are balanced, i.e., when all cells in the ANOVA design have equal n, when there are no missing cells in the design, and, if nesting is present, when the nesting is balanced so that equal numbers of levels of the factors that are nested appear in the levels of the factor(s) that they are nested in. In that case, the X’X matrix (where X stands for the design matrix) is a diagonal matrix, and many of the computations necessary to compute the ANOVA results (such as matrix inversion) are greatly simplified.

Simple Regression. Simple regression designs involve a single continuous predictor variable. If there were 3 cases with values on a predictor variable P of, say, 7, 4, and 9, and the design is for the first-order effect of P, the X matrix would be

and using P for X1 the regression equation would be

Y = b0 + b1P

If the simple regression design is for a higher-order effect of P, say the quadratic effect, the values in the X1 column of the design matrix would be raised to the 2nd power, that is, squared

and using P2 for X1 the regression equation would be

Y = b0 + b1P2

The sigma-restricted and overparameterized coding methods do not apply to simple regression designs and any other design containing only continuous predictors (since there are no categorical predictors to code). Regardless of which coding method is chosen, values on the continuous predictor variables are raised to the desired power and used as the values for the X variables. No recoding is performed. It is therefore sufficient, in describing regression designs, to simply describe the regression equation without explicitly describing the design matrix X.

Multiple Regression. Multiple regression designs are to continuous predictor variables as main effect ANOVA designs are to categorical predictor variables, that is, multiple regression designs contain the separate simple regression designs for 2 or more continuous predictor variables. The regression equation for a multiple regression design for the first-order effects of 3 continuous predictor variables P, Q, and R would be

Y = b0 + b1P + b2Q + b3R

Factorial Regression. Factorial regression designs are similar to factorial ANOVA designs, in which combinations of the levels of the factors are represented in the design. In factorial regression designs, however, there may be many more such possible combinations of distinct levels for the continuous predictor variables than there are cases in the data set. To simplify matters, full-factorial regression designs are defined as designs in which all possible products of the continuous predictor variables are represented in the design. For example, the full-factorial regression design for two continuous predictor variables P and Q would include the main effects (i.e., the first-order effects) of P and Q and their 2-way P by Q interaction effect, which is represented by the product of P and Q scores for each case. The regression equation would be

Y = b0 + b1P + b2Q + b3P*Q

Factorial regression designs can also be fractional, that is, higher-order effects can be omitted from the design. A fractional factorial design to degree 2 for 3 continuous predictor variables P, Q, and R would include the main effects and all 2-way interactions between the predictor variables

Y = b0 + b1P + b2Q + b3R + b4P*Q + b5P*R + b6Q*R

Polynomial Regression. Polynomial regression designs are designs which contain main effects and higher-order effects for the continuous predictor variables but do not include interaction effects between predictor variables. For example, the polynomial regression design to degree 2 for three continuous predictor variables P, Q, and R would include the main effects (i.e., the first-order effects) of P, Q, and R and their quadratic (i.e., second-order) effects, but not the 2-way interaction effects or the P by Q by R 3-way interaction effect.

Y = b0 + b1P + b2P2 + b3Q + b4Q2 + b5R + b6R2

Polynomial regression designs do not have to contain all effects up to the same degree for every predictor variable. For example, main, quadratic, and cubic effects could be included in the design for some predictor variables, and effects up the fourth degree could be included in the design for other predictor variables.

Response Surface Regression. Quadratic response surface regression designs are a hybrid type of design with characteristics of both polynomial regression designs and fractional factorial regression designs. Quadratic response surface regression designs contain all the same effects of polynomial regression designs to degree 2 and additionally the 2-way interaction effects of the predictor variables. The regression equation for a quadratic response surface regression design for 3 continuous predictor variables P, Q, and R would be

Y = b0 + b1P + b2P2 + b3Q + b4Q2 + b5R + b6R2 + b7P*Q + b8P*R + b9Q*R

These types of designs are commonly employed in applied research (e.g., in industrial experimentation), and a detailed discussion of these types of designs is also presented in the Experimental Design topic (see Central composite designs).

Mixture Surface Regression. Mixture surface regression designs are identical to factorial regression designs to degree 2 except for the omission of the intercept. Mixtures, as the name implies, add up to a constant value; the sum of the proportions of ingredients in different recipes for some material all must add up 100%. Thus, the proportion of one ingredient in a material is redundant with the remaining ingredients. Mixture surface regression designs deal with this redundancy by omitting the intercept from the design. The design matrix for a mixture surface regression design for 3 continuous predictor variables P, Q, and R would be

Y = b1P + b2Q + b3R + b4P*Q + b5P*R + b6Q*R

These types of designs are commonly employed in applied research (e.g., in industrial experimentation), and a detailed discussion of these types of designs is also presented in the Experimental Design topic (see Mixture designs and triangular surfaces).

Analysis of Covariance. In general, between designs which contain both categorical and continuous predictor variables can be called ANCOVA designs. Traditionally, however, ANCOVA designs have referred more specifically to designs in which the first-order effects of one or more continuous predictor variables are taken into account when assessing the effects of one or more categorical predictor variables. A basic introduction to analysis of covariance can also be found in the Analysis of covariance (ANCOVA) section of the ANOVA/MANOVA topic.

To illustrate, suppose a researcher wants to assess the influences of a categorical predictor variable A with 3 levels on some outcome, and that measurements on a continuous predictor variable P, known to covary with the outcome, are available. If the data for the analysis are

then the sigma-restricted X matrix for the design that includes the separate first-order effects of P and A would be

The b2 and b3 coefficients in the regression equation

Y = b0 + b1X1 + b2X2 + b3X3

represent the influences of group membership on the A categorical predictor variable, controlling for the influence of scores on the P continuous predictor variable. Similarly, the b1 coefficient represents the influence of scores on P controlling for the influences of group membership on A. This traditional ANCOVA analysis gives a more sensitive test of the influence of A to the extent that P reduces the prediction error, that is, the residuals for the outcome variable.

The X matrix for the same design using the overparameterized model would be

The interpretation is unchanged except that the influences of group membership on the A categorical predictor variables are represented by the b2, b3 and b4 coefficients in the regression equation

Y = b0 + b1X1 + b2X2 + b3X3 + b4X4

Separate Slope Designs. The traditional analysis of covariance (ANCOVA) design for categorical and continuous predictor variables is inappropriate when the categorical and continuous predictors interact in influencing responses on the outcome. The appropriate design for modeling the influences of the predictors in this situation is called the separate slope design. For the same example data used to illustrate traditional ANCOVA, the overparameterized X matrix for the design that includes the main effect of the three-level categorical predictor A and the 2-way interaction of P by A would be

The b4, b5, and b6 coefficients in the regression equation

Y = b0 + b1X1 + b2X2 + b3X3 + b4X4 + b5X5 + b6X6

give the separate slopes for the regression of the outcome on P within each group on A, controlling for the main effect of A.

As with nested ANOVA designs, the sigma-restricted coding of effects for separate slope designs is overly restrictive, so only the overparameterized model is used to represent separate slope designs. In fact, separate slope designs are identical in form to nested ANOVA designs, since the main effects for continuous predictors are omitted in separate slope designs.

Homogeneity of Slopes. The appropriate design for modeling the influences of continuous and categorical predictor variables depends on whether the continuous and categorical predictors interact in influencing the outcome. The traditional analysis of covariance (ANCOVA) design for continuous and categorical predictor variables is appropriate when the continuous and categorical predictors do not interact in influencing responses on the outcome, and the separate slope design is appropriate when the continuous and categorical predictors do interact in influencing responses. The homogeneity of slopes designs can be used to test whether the continuous and categorical predictors interact in influencing responses, and thus, whether the traditional ANCOVA design or the separate slope design is appropriate for modeling the effects of the predictors. For the same example data used to illustrate the traditional ANCOVA and separate slope designs, the overparameterized X matrix for the design that includes the main effect of P, the main effect of the three-level categorical predictor A, and the 2-way interaction of P by A would be

If the b5, b6, or b7 coefficient in the regression equation

Y = b0 + b1X1 + b2X2 + b3X3 + b4X4 + b5X5 + b6X6 + b7X7

is non-zero, the separate slope model should be used. If instead all 3 of these regression coefficients are zero the traditional ANCOVA design should be used.

The sigma-restricted X matrix for the homogeneity of slopes design would be

Using this X matrix, if the b4, or b5 coefficient in the regression equation

Y = b0 + b1X1 + b2X2 + b3X3 + b4X4 + b5X5

is non-zero, the separate slope model should be used. If instead both of these regression coefficients are zero the traditional ANCOVA design should be used.

Mixed Model ANOVA and ANCOVA. Designs that contain random effects for one or more categorical predictor variables are called mixed-model designs. Random effects are classification effects where the levels of the effects are assumed to be randomly selected from an infinite population of possible levels. The solution for the normal equations in mixed-model designs is identical to the solution for fixed-effect designs (i.e., designs which do not contain Random effects. Mixed-model designs differ from fixed-effect designs only in the way in which effects are tested for significance. In fixed-effect designs, between effects are always tested using the mean squared residual as the error term. In mixed-model designs, between effects are tested using relevant error terms based on the covariation of random sources of variation in the design. Specifically, this is done using Satterthwaite’s method of denominator synthesis (Satterthwaite, 1946), which finds the linear combinations of sources of random variation that serve as appropriate error terms for testing the significance of the respective effect of interest. A basic discussion of these types of designs, and methods for estimating variance components for the random effects can also be found in the Variance Components and Mixed Model ANOVA/ANCOVA topic.

Mixed-model designs, like nested designs and separate slope designs, are designs in which the sigma-restricted coding of categorical predictors is overly restrictive. Mixed-model designs require estimation of the covariation between the levels of categorical predictor variables, and the sigma-restricted coding of categorical predictors suppresses this covariation. Thus, only the overparameterized model is used to represent mixed-model designs (some programs will use the sigma-restricted approach and a so-called “restricted model” for random effects; however, only the overparameterized model as described in General Linear Models applies to both balanced and unbalanced designs, as well as designs with missing cells; see Searle, Casella, & McCullock, 1992, p. 127). It is important to recognize, however, that sigma-restricted coding can be used to represent any between design, with the exceptions of mixed-model, nested, and separate slope designs. Furthermore, some types of hypotheses can only be tested using the sigma-restricted coding (i.e., the effective hypothesis, Hocking, 1996), thus the greater generality of the overparameterized model for representing between designs does not justify it being used exclusively for representing categorical predictors in the general linear model.

### Within-Subject (Repeated Measures) Designs

Overview. It is quite common for researchers to administer the same test to the same subjects repeatedly over a period of time or under varying circumstances. In essence, we are interested in examining differences within each subject, for example, subjects’ improvement over time. Such designs are referred to as within-subject designs or repeated measures designs. A basic introduction to repeated measures designs is also provided in the Between-groups and repeated measures section of the ANOVA/MANOVA topic.

For example, imagine that we want to monitor the improvement of students’ algebra skills over two months of instruction. A standardized algebra test is administered after one month (level 1 of the repeated measures factor), and a comparable test is administered after two months (level 2 of the repeated measures factor). Thus, the repeated measures factor (Time) has 2 levels. Now, suppose that scores for the 2 algebra tests (i.e., values on the Y1 and Y2 variables at Time 1 and Time 2, respectively) are transformed into scores on a new composite variable (i.e., values on the T1), using the linear transformation

T = YM

where M is an orthonormal contrast matrix. Specifically, if

then the difference of the mean score on T1 from 0 indicates the improvement (or deterioration) of scores across the 2 levels of Time.

One-Way Within-Subject Designs. The example algebra skills study with the Time repeated measures factor (see also within-subjects design Overview) illustrates a one-way within-subject design. In such designs, orthonormal contrast transformations of the scores on the original dependent Y variables are performed via the M transformation (orthonormal transformations correspond to orthogonal rotations of the original variable axes). If any b0 coefficient in the regression of a transformed T variable on the intercept is non-zero, this indicates a change in responses across the levels of the repeated measures factor, that is, the presence of a main effect for the repeated measure factor on responses.

What if the between design includes effects other than the intercept? If any of the b1 through bk coefficients in the regression of a transformed T variable on X are non-zero, this indicates a different change in responses across the levels of the repeated measures factor for different levels of the corresponding between effect, i.e., the presence of a within by between interaction effect on responses.

The same between-subject effects that can be tested in designs with no repeated-measures factors can also be tested in designs that do include repeated-measures factors. This is accomplished by creating a transformed dependent variable which is the sum of the original dependent variables divided by the square root of the number of original dependent variables. The same tests of between-subject effects that are performed in designs with no repeated-measures factors (including tests of the between intercept) are performed on this transformed dependent variable.

Multi-Way Within-Subject Designs. Suppose that in the example algebra skills study with the Time repeated measures factor (see the within-subject designs Overview), students were given a number problem test and then a word problem test on each testing occasion. Test could then be considered as a second repeated measures factor, with scores on the number problem tests representing responses at level 1 of the Test repeated measure factor, and scores on the word problem tests representing responses at level 2 of the Test repeated measure factor. The within subject design for the study would be a 2 (Time) by 2 (Test) full-factorial design, with effects for Time, Test, and the Time by Test interaction.

To construct transformed dependent variables representing the effects of Time, Test, and the Time by Test interaction, three respective M transformations of the original dependent Y variables are performed. Assuming that the original Y variables are in the order Time 1 – Test 1, Time 1 – Test 2, Time 2 – Test 1, and Time 2 – Test 2, the M matrices for the Time, Test, and the Time by Test interaction would be

The differences of the mean scores on the transformed T variables from 0 are then used to interpret the corresponding within-subject effects. If the b0 coefficient in the regression of a transformed T variable on the intercept is non-zero, this indicates a change in responses across the levels of a repeated measures effect, that is, the presence of the corresponding main or interaction effect for the repeated measure factors on responses.

Interpretation of within by between interaction effects follow the same procedures as for one-way within designs, except that now within by between interactions are examined for each within effect by between effect combination.

Multivariate Approach to Repeated Measures. When the repeated measures factor has more than 2 levels, then the M matrix will have more than a single column. For example, for a repeated measures factor with 3 levels (e.g., Time 1, Time 2, Time 3), the M matrix will have 2 columns (e.g., the two transformations of the dependent variables could be (1) Time 1 vs. Time 2 and Time 3 combined, and (2) Time 2 vs. Time 3). Consequently, the nature of the design is really multivariate, that is, there are two simultaneous dependent variables, which are transformations of the original dependent variables. Therefore, when testing repeated measures effects involving more than a single degree of freedom (e.g., a repeated measures main effect with more than 2 levels), you can compute multivariate test statistics to test the respective hypotheses. This is a different (and usually the preferred) approach than the univariate method that is still widely used. For a further discussion of the multivariate approach to testing repeated measures effects, and a comparison to the traditional univariate approach, see the Sphericity and compound symmetry section of the ANOVA/MANOVA topic.

Doubly Multivariate Designs. If the product of the number of levels for each within-subject factor is equal to the number of original dependent variables, the within-subject design is called a univariate repeated measures design. The within design is univariate because there is one dependent variable representing each combination of levels of the within-subject factors. Note that this use of the term univariate design is not to be confused with the univariate and multivariate approach to the analysis of repeated measures designs, both of which can be used to analyze such univariate (single-dependent-variable-only) designs. When there are two or more dependent variables for each combination of levels of the within-subject factors, the within-subject design is called a multivariate repeated measures design, or more commonly, a doubly multivariate within-subject design. This term is used because the analysis for each dependent measure can be done via the multivariate approach; so when there is more than one dependent measure, the design can be considered doubly-multivariate.

Doubly multivariate design are analyzed using a combination of univariate repeated measures and multivariate analysis techniques. To illustrate, suppose in an algebra skills study, tests are administered three times (repeated measures factor Time with 3 levels). Two test scores are recorded at each level of Time: a Number Problem score and a Word Problem score. Thus, scores on the two types of tests could be treated as multiple measures on which improvement (or deterioration) across Time could be assessed. M transformed variables could be computed for each set of test measures, and multivariate tests of significance could be performed on the multiple transformed measures, as well as on the each individual test measure.

### Multivariate Designs

Overview. When there are multiple dependent variables in a design, the design is said to be multivariate. Multivariate measures of association are by nature more complex than their univariate counterparts (such as the correlation coefficient, for example). This is because multivariate measures of association must take into account not only the relationships of the predictor variables with responses on the dependent variables, but also the relationships among the multiple dependent variables. By doing so, however, these measures of association provide information about the strength of the relationships between predictor and dependent variables independent of the dependent variable interrelationships. A basic discussion of multivariate designs is also presented in the Multivariate Designs section in the ANOVA/MANOVA topic.

The most commonly used multivariate measures of association all can be expressed as functions of the eigenvalues of the product matrix

E-1H

where E is the error SSCP matrix (i.e., the matrix of sums of squares and cross-products for the dependent variables that are not accounted for by the predictors in the between design), and H is a hypothesis SSCP matrix (i.e., the matrix of sums of squares and cross-products for the dependent variables that are accounted for by all the predictors in the between design, or the sums of squares and cross-products for the dependent variables that are accounted for by a particular effect). If

li = the ordered eigenvalues of E-1H, if E-1 exists

then the 4 commonly used multivariate measures of association are

Wilks’ lambda = P[1/(1+li)]

Pillai’s trace = Sli/(1+li)

Hotelling-Lawley trace = Sli

Roy’s largest root = l1

These 4 measures have different upper and lower bounds, with Wilks’ lambda perhaps being the most easily interpretable of the 4 measures. Wilks’ lambda can range from 0 to 1, with 1 indicating no relationship of predictors to responses and 0 indicating a perfect relationship of predictors to responses. 1 – Wilks’ lambda can be interpreted as the multivariate counterpart of a univariate R-squared, that is, it indicates the proportion of generalized variance in the dependent variables that is accounted for by the predictors.

The 4 measures of association are also used to construct multivariate tests of significance. These multivariate tests are covered in detail in a number of sources (e.g., Finn, 1974; Tatsuoka, 1971).

## Estimation and Hypothesis Testing

The following sections discuss details concerning hypothesis testing in the context of STATISTICA‘s GLM module, for example, how the test for the overall model fit is computed, the options for computing tests for categorical effects in unbalanced or incomplete designs, how and when custom-error terms can be chosen, and the logic of testing custom-hypotheses in factorial or regression designs.

### Whole Model Tests

Partitioning Sums of Squares. A fundamental principle of least squares methods is that variation on a dependent variable can be partitioned, or divided into parts, according to the sources of the variation. Suppose that a dependent variable is regressed on one or more predictor variables, and that for convenience the dependent variable is scaled so that its mean is 0. Then a basic least squares identity is that the total sum of squared values on the dependent variable equals the sum of squared predicted values plus the sum of squared residual values. Stated more generally,

S(y – y-bar)2 = S(y-hat – y-bar)2 + S(y – y-hat)2

where the term on the left is the total sum of squared deviations of the observed values on the dependent variable from the dependent variable mean, and the respective terms on the right are (1) the sum of squared deviations of the predicted values for the dependent variable from the dependent variable mean and (2) the sum of the squared deviations of the observed values on the dependent variable from the predicted values, that is, the sum of the squared residuals. Stated yet another way,

Total SS = Model SS + Error SS

Note that the Total SS is always the same for any particular data set, but that the Model SS and the Error SS depend on the regression equation. Assuming again that the dependent variable is scaled so that its mean is 0, the Model SS and the Error SS can be computed using

Model SS = b’X’Y

Error SS = Y’Y – b’X’Y

Testing the Whole Model. Given the Model SS and the Error SS, we can perform a test that all the regression coefficients for the X variables (b1 through bk) are zero. This test is equivalent to a comparison of the fit of the regression surface defined by the predicted values (computed from the whole model regression equation) to the fit of the regression surface defined solely by the dependent variable mean (computed from the reduced regression equation containing only the intercept). Assuming that X’X is full-rank, the whole model hypothesis mean square

MSH = (Model SS)/k

is an estimate of the variance of the predicted values. The error mean square

s2 = MSE = (Error SS)/(n-k-1)

is an unbiased estimate of the residual or error variance. The test statistic is

F = MSH/MSE

where F has (k, n – k – 1) degrees of freedom.

If X’X is not full rank, r + 1 is substituted for k, where r is the rank or the number of non-redundant columns of X’X.

Note that in the case of non-intercept models, some multiple regression programs will compute the full model test based on the proportion of variance around 0 (zero) accounted for by the predictors; for more information (see Kvålseth, 1985; Okunade, Chang, and Evans, 1993), while other will actually compute both values (i.e., based on the residual variance around 0, and around the respective dependent variable means.

Limitations of Whole Model Tests. For designs such as one-way ANOVA or simple regression designs, the whole model test by itself may be sufficient for testing general hypotheses about whether or not the single predictor variable is related to the outcome. In more complex designs, however, hypotheses about specific X variables or subsets of X variables are usually of interest. For example, you might want to make inferences about whether a subset of regression coefficients are 0, or you might want to test whether subpopulation means corresponding to combinations of specific X variables differ. The whole model test is usually insufficient for such purposes.

A variety of methods have been developed for testing specific hypotheses. Like whole model tests, many of these methods rely on comparisons of the fit of different models (e.g., Type I, Type II, and the effective hypothesis sums of squares). Other methods construct tests of linear combinations of regression coefficients in order to test mean differences (e.g., Type III, Type IV, and Type V sums of squares). For designs that contain only first-order effects of continuous predictor variables (i.e., multiple regression designs), many of these methods are equivalent (i.e., Type II through Type V sums of squares all test the significance of partial regression coefficients). However, there are important distinctions between the different hypothesis testing techniques for certain types of ANOVA designs (i.e., designs with unequal cell n‘s and/or missing cells).

All methods for testing hypotheses, however, involve the same hypothesis testing strategy employed in whole model tests, that is, the sums of squares attributable to an effect (using a given criterion) is computed, and then the mean square for the effect is tested using an appropriate error term.

When there are categorical predictors in the model, arranged in a factorial ANOVA design, then we are typically interested in the main effects for and interaction effects between the categorical predictors. However, when the design is not balanced (has unequal cell n’s, and consequently, the coded effects for the categorical factors are usually correlated), or when there are missing cells in a full factorial ANOVA design, then there is ambiguity regarding the specific comparisons between the (population, or least-squares) cell means that constitute the main effects and interactions of interest. These issues are discussed in great detail in Milliken and Johnson (1986), and if you routinely analyze incomplete factorial designs, you should consult their discussion of various problems and approaches to solving them.

In addition to the widely used methods that are commonly labeled Type I, II, III, and IV sums of squares (see Goodnight, 1980), we also offer different methods for testing effects in incomplete designs, that are widely used in other areas (and traditions) of research.

Type V sums of squares. Specifically, we propose the term Type V sums of squares to denote the approach that is widely used in industrial experimentation, to analyze fractional factorial designs; these types of designs are discussed in detail in the 2**(k-p) Fractional Factorial Designs section of the Experimental Design topic. In effect, for those effects for which tests are performed all population marginal means (least squares means) are estimable.

Type VI sums of squares. Second, in keeping with the Type i labeling convention, we propose the term Type VI sums of squares to denote the approach that is often used in programs that only implement the sigma-restricted model (which is not well suited for certain types of designs; we offer a choice between the sigma-restricted and overparameterized model models). This approach is identical to what is described as the effective hypothesis method in Hocking (1996).

Contained Effects. The following descriptions will use the term contained effect. An effect E1 (e.g., A * B interaction) is contained in another effect E2 if:

• Both effects involve the same continuous predictor variable (if included in the model; e.g., A * B * X would be contained in A * C * X, where A, B, and C are categorical predictors, and X is a continuous predictor); or
• E2 has more categorical predictors than does E1, and, if E1 includes any categorical predictors, they also appear in E2 (e.g., A * B would be contained in the A * B * C interaction).

Type I Sums of Squares. Type I sums of squares involve a sequential partitioning of the whole model sums of squares. A hierarchical series of regression equations are estimated, at each step adding an additional effect into the model. In Type I sums of squares, the sums of squares for each effect are determined by subtracting the predicted sums of squares with the effect in the model from the predicted sums of squares for the preceding model not including the effect. Tests of significance for each effect are then performed on the increment in the predicted sums of squares accounted for by the effect. Type I sums of squares are therefore sometimes called sequential or hierarchical sums of squares.

Type I sums of squares are appropriate to use in balanced (equal n) ANOVA designs in which effects are entered into the model in their natural order (i.e., any main effects are entered before any two-way interaction effects, any two-way interaction effects are entered before any three-way interaction effects, and so on). Type I sums of squares are also useful in polynomial regression designs in which any lower-order effects are entered before any higher-order effects. A third use of Type I sums of squares is to test hypotheses for hierarchically nested designs, in which the first effect in the design is nested within the second effect, the second effect is nested within the third, and so on.

One important property of Type I sums of squares is that the sums of squares attributable to each effect add up to the whole model sums of squares. Thus, Type I sums of squares provide a complete decomposition of the predicted sums of squares for the whole model. This is not generally true for any other type of sums of squares. An important limitation of Type I sums of squares, however, is that the sums of squares attributable to a specific effect will generally depend on the order in which the effects are entered into the model. This lack of invariance to order of entry into the model limits the usefulness of Type I sums of squares for testing hypotheses for certain designs (e.g., fractional factorial designs).

Type II Sums of Squares. Type II sums of squares are sometimes called partially sequential sums of squares. Like Type I sums of squares, Type II sums of squares for an effect controls for the influence of other effects. Which other effects to control for, however, is determined by a different criterion. In Type II sums of squares, the sums of squares for an effect is computed by controlling for the influence of all other effects of equal or lower degree. Thus, sums of squares for main effects control for all other main effects, sums of squares for two-way interactions control for all main effects and all other two-way interactions, and so on.

Unlike Type I sums of squares, Type II sums of squares are invariant to the order in which effects are entered into the model. This makes Type II sums of squares useful for testing hypotheses for multiple regression designs, for main effect ANOVA designs, for full-factorial ANOVA designs with equal cell ns, and for hierarchically nested designs.

There is a drawback to the use of Type II sums of squares for factorial designs with unequal cell ns. In these situations, Type II sums of squares test hypotheses that are complex functions of the cell ns that ordinarily are not meaningful. Thus, a different method for testing hypotheses is usually preferred.

Type III Sums of Squares. Type I and Type II sums of squares usually are not appropriate for testing hypotheses for factorial ANOVA designs with unequal ns. For ANOVA designs with unequal ns, however, Type III sums of squares test the same hypothesis that would be tested if the cell ns were equal, provided that there is at least one observation in every cell. Specifically, in no-missing-cell designs, Type III sums of squares test hypotheses about differences in subpopulation (or marginal) means. When there are no missing cells in the design, these subpopulation means are least squares means, which are the best linear-unbiased estimates of the marginal means for the design (see, Milliken and Johnson, 1986).

Tests of differences in least squares means have the important property that they are invariant to the choice of the coding of effects for categorical predictor variables (e.g., the use of the sigma-restricted or overparameterized model) and to the choice of the particular g2 inverse of X’X used to solve the normal equations. Thus, tests of linear combinations of least squares means in general, including Type III tests of differences in least squares means, are said to not depend on the parameterization of the design. This makes Type III sums of squares useful for testing hypotheses for any design for which Type I or Type II sums of squares are appropriate, as well as for any unbalanced ANOVA design with no missing cells.

The Type III sums of squares attributable to an effect is computed as the sums of squares for the effect controlling for any effects of equal or lower degree and orthogonal to any higher-order interaction effects (if any) that contain it. The orthogonality to higher-order containing interactions is what gives Type III sums of squares the desirable properties associated with linear combinations of least squares means in ANOVA designs with no missing cells. But for ANOVA designs with missing cells, Type III sums of squares generally do not test hypotheses about least squares means, but instead test hypotheses that are complex functions of the patterns of missing cells in higher-order containing interactions and that are ordinarily not meaningful. In this situation Type V sums of squares or tests of the effective hypothesis (Type VI sums of squares) are preferred.

Type IV Sums of Squares. Type IV sums of squares were designed to test “balanced” hypotheses for lower-order effects in ANOVA designs with missing cells. Type IV sums of squares are computed by equitably distributing cell contrast coefficients for lower-order effects across the levels of higher-order containing interactions.

Type IV sums of squares are not recommended for testing hypotheses for lower-order effects in ANOVA designs with missing cells, even though this is the purpose for which they were developed. This is because Type IV sum-of-squares are invariant to some but not all g2 inverses of X’X that could be used to solve the normal equations. Specifically, Type IV sums of squares are invariant to the choice of a g2 inverse of X’X given a particular ordering of the levels of the categorical predictor variables, but are not invariant to different orderings of levels. Furthermore, as with Type III sums of squares, Type IV sums of squares test hypotheses that are complex functions of the patterns of missing cells in higher-order containing interactions and that are ordinarily not meaningful.

Statisticians who have examined the usefulness of Type IV sums of squares have concluded that Type IV sums of squares are not up to the task for which they were developed:

• Milliken & Johnson (1992, p. 204) write: “It seems likely that few, if any, of the hypotheses tested by the Type IV analysis of [some programs] will be of particular interest to the experimenter.”
• Searle (1987, p. 463-464) writes: “In general, [Type IV] hypotheses determined in this nature are not necessarily of any interest.”; and (p. 465) “This characteristic of Type IV sums of squares for rows depending on the sequence of rows establishes their non-uniqueness, and this in turn emphasizes that the hypotheses they are testing are by no means necessarily of any general interest.”
• Hocking (1985, p. 152), in an otherwise comprehensive introduction to general linear models, writes: “For the missing cell problem, [some programs] offers a fourth analysis, Type IV, which we shall not discuss.”

So, we recommend that you use the Type IV sums of squares solution with caution, and that you understand fully the nature of the (often non-unique) hypotheses that are being testing, before attempting interpretations of the results. Furthermore, in ANOVA designs with no missing cells, Type IV sums of squares are always equal to Type III sums of squares, so the use of Type IV sums of squares is either (potentially) inappropriate, or unnecessary, depending on the presence of missing cells in the design.

Type V Sums of Squares. Type V sums of squares were developed as an alternative to Type IV sums of squares for testing hypotheses in ANOVA designs in missing cells. Also, this approach is widely used in industrial experimentation, to analyze fractional factorial designs; these types of designs are discussed in detail in the 2**(k-p) Fractional Factorial Designs section of the Experimental Design topic. In effect, for effects for which tests are performed all population marginal means (least squares means) are estimable.

Type V sums of squares involve a combination of the methods employed in computing Type I and Type III sums of squares. Specifically, whether or not an effect is eligible to be dropped from the model is determined using Type I procedures, and then hypotheses are tested for effects not dropped from the model using Type III procedures. Type V sums of squares can be illustrated by using a simple example. Suppose that the effects considered are A, B, and A by B, in that order, and that A and B are both categorical predictors with, say, 3 and 2 levels, respectively. The intercept is first entered into the model. Then A is entered into the model, and its degrees of freedom are determined (i.e., the number of non-redundant columns for A in X’X, given the intercept). If A‘s degrees of freedom are less than 2 (i.e., its number of levels minus 1), it is eligible to be dropped. Then B is entered into the model, and its degrees of freedom are determined (i.e., the number of non-redundant columns for B in X’X, given the intercept and A). If B‘s degrees of freedom are less than 1 (i.e., its number of levels minus 1), it is eligible to be dropped. Finally, A by B is entered into the model, and its degrees of freedom are determined (i.e., the number of non-redundant columns for A by B in X’X, given the intercept, A, and B). If B‘s degrees of freedom are less than 2 (i.e., the product of the degrees of freedom for its factors if there were no missing cells), it is eligible to be dropped. Type III sums of squares are then computed for the effects that were not found to be eligible to be dropped, using the reduced model in which any eligible effects are dropped. Tests of significance, however, use the error term for the whole model prior to dropping any eligible effects.

Note that Type V sums of squares involve determining a reduced model for which all effects remaining in the model have at least as many degrees of freedom as they would have if there were no missing cells. This is equivalent to finding a subdesign with no missing cells such that the Type III sums of squares for all effects in the subdesign reflect differences in least squares means.

Appropriate caution should be exercised when using Type V sums of squares. Dropping an effect from a model is the same as assuming that the effect is unrelated to the outcome (see, e.g., Hocking, 1996). The reasonableness of the assumption does not necessarily insure its validity, so when possible the relationships of dropped effects to the outcome should be inspected. It is also important to note that Type V sums of squares are not invariant to the order in which eligibility for dropping effects from the model is evaluated. Different orders of effects could produce different reduced models.

In spite of these limitations, Type V sums of squares for the reduced model have all the same properties of Type III sums of squares for ANOVA designs with no missing cells. Even in designs with many missing cells (such as fractional factorial designs, in which many high-order interaction effects are assumed to be zero), Type V sums of squares provide tests of meaningful hypotheses, and sometimes hypotheses that cannot be tested using any other method.

Type VI (Effective Hypothesis) Sums of Squares. Type I through Type V sums of squares can all be viewed as providing tests of hypotheses that subsets of partial regression coefficients (controlling for or orthogonal to appropriate additional effects) are zero. Effective hypothesis tests (developed by Hocking, 1996) are based on the philosophy that the only unambiguous estimate of an effect is the proportion of variability on the outcome that is uniquely attributable to the effect. The overparameterized coding of effects for categorical predictor variables generally cannot be used to provide such unique estimates for lower-order effects. Effective hypothesis tests, which we propose to call Type VI sums of squares, use the sigma-restricted coding of effects for categorical predictor variables to provide unique effect estimates even for lower-order effects.

The method for computing Type VI sums of squares is straightforward. The sigma-restricted coding of effects is used, and for each effect, its Type VI sums of squares is the difference of the model sums of squares for all other effects from the whole model sums of squares. As such, the Type VI sums of squares provide an unambiguous estimate of the variability of predicted values for the outcome uniquely attributable to each effect.

In ANOVA designs with missing cells, Type VI sums of squares for effects can have fewer degrees of freedom than they would have if there were no missing cells, and for some missing cell designs, can even have zero degrees of freedom. The philosophy of Type VI sums of squares is to test as much as possible of the original hypothesis given the observed cells. If the pattern of missing cells is such that no part of the original hypothesis can be tested, so be it. The inability to test hypotheses is simply the price we pay for having no observations at some combinations of the levels of the categorical predictor variables. The philosophy is that it is better to admit that a hypothesis cannot be tested than it is to test a distorted hypothesis that may not meaningfully reflect the original hypothesis.

Type VI sums of squares cannot generally be used to test hypotheses for nested ANOVA designs, separate slope designs, or mixed-model designs, because the sigma-restricted coding of effects for categorical predictor variables is overly restrictive in such designs. This limitation, however, does not diminish the fact that Type VI sums of squares can b

### Error Terms for Tests

Lack-of-Fit Tests using Pure Error. Whole model tests and tests based on the 6 types of sums of squares use the mean square residual as the error term for tests of significance. For certain types of designs, however, the residual sum of squares can be further partitioned into meaningful parts which are relevant for testing hypotheses. One such type of design is a simple regression design in which there are subsets of cases all having the same values on the predictor variable. For example, performance on a task could be measured for subjects who work on the task under several different room temperature conditions. The test of significance for the Temperature effect in the linear regression of Performance on Temperature would not necessarily provide complete information on how Temperature relates to Performance; the regression coefficient for Temperature only reflects its linear effect on the outcome.

One way to glean additional information from this type of design is to partition the residual sums of squares into lack-of-fit and pure error components. In the example just described, this would involve determining the difference between the sum of squares that cannot be predicted by Temperature levels, given the linear effect of Temperature (residual sums of squares) and the pure error; this difference would be the sums of squares associated with the lack-of-fit (in this example, of the linear model). The test of lack-of-fit, using the mean square pure error as the error term, would indicate whether non-linear effects of Temperature are needed to adequately model Tempature’s influence on the outcome. Further, the linear effect could be tested using the pure error term, thus providing a more sensitive test of the linear effect independent of any possible nonlinear effect.

Designs with Zero Degrees of Freedom for Error. When the model degrees of freedom equal the number of cases or subjects, the residual sums of squares will have zero degrees of freedom and preclude the use of standard hypothesis tests. This sometimes occurs for overfitted designs (designs with many predictors, or designs with categorical predictors having many levels). However, in some designed experiments, such as experiments using split-plot designs or highly fractionalized factorial designs as commonly used in industrial experimentation, it is no accident that the residual sum of squares has zero degrees of freedom. In such experiments, mean squares for certain effects are planned to be used as error terms for testing other effects, and the experiment is designed with this in mind. It is entirely appropriate to use alternatives to the mean square residual as error terms for testing hypotheses in such designs.

Tests in Mixed Model Designs. Designs which contain random effects for one or more categorical predictor variables are called mixed-model designs. These types of designs, and the analysis of those designs, is also described in detail in the Variance Components and Mixed Model ANOVA/ANCOVA topic. Random effects are classification effects where the levels of the effects are assumed to be randomly selected from an infinite population of possible levels. The solution for the normal equations in mixed-model designs is identical to the solution for fixed-effect designs (i.e., designs which do not contain random effects). Mixed-model designs differ from fixed-effect designs only in the way in which effects are tested for significance. In fixed-effect designs, between effects are always tested using the mean square residual as the error term. In mixed-model designs, between effects are tested using relevant error terms based on the covariation of sources of variation in the design. Also, only the overparameterized model is used to code effects for categorical predictors in mixed-models, because the sigma-restricted model is overly restrictive.

The covariation of sources of variation in the design is estimated by the elements of a matrix called the Expected Mean Squares (EMS) matrix. This non-square matrix contains elements for the covariation of each combination of pairs of sources of variation and for each source of variation with Error. Specifically, each element is the mean square for one effect (indicated by the column) that is expected to be accounted by another effect (indicated by the row), given the observed covariation in their levels. Note that expected mean squares can be computing using any type of sums of squares from Type I through Type V. Once the EMS matrix is computed, it is used to the solve for the linear combinations of sources of random variation that are appropriate to use as error terms for testing the significance of the respective effects. This is done using Satterthwaite’s method of denominator synthesis (Satterthwaite, 1946). Detailed discussions of methods for testing effects in mixed-models, and related methods for estimating variance components for random effects, can be found in the Variance Components and Mixed Model ANOVA/ANCOVA topic.

### Testing Specific Hypotheses

Whole model tests and tests based on sums of squares attributable to specific effects illustrate two general types of hypotheses that can be tested using the general linear model. Still, there may be other types of hypotheses the researcher wishes to test that do not fall into either of these categories. For example, hypotheses about subsets of effects may be of interest, or hypotheses involving comparisons of specific levels of categorical predictor variables may be of interest.

Estimability of Hypotheses. Before considering tests of specific hypotheses of this sort, it is important to address the issue of estimability. A test of a specific hypothesis using the general linear model must be framed in terms of the regression coefficients for the solution of the normal equations. If the X’X matrix is less than full rank, the regression coefficients depend on the particular g2 inverse used for solving the normal equations, and the regression coefficients will not be unique. When the regression coefficients are not unique, linear functions (f) of the regression coefficients having the form

f = Lb

where L is a vector of coefficients, will also in general not be unique. However, Lb for an L which satisfies

L = L(X’X)X’X

is invariant for all possible g2 inverses, and is therefore called an estimable function.

The theory of estimability of linear functions is an advanced topic in the theory of algebraic invariants (Searle, 1987, provides a comprehensive introduction), but its implications are clear enough. One instance of non-estimability of a hypothesis has been encountered in tests of the effective hypothesis which have zero degrees of freedom. On the other hand, Type III sums of squares for categorical predictor variable effects in ANOVA designs with no missing cells (and the least squares means in such designs) provide an example of estimable functions which do not depend on the model parameterization (i.e., the particular g2 inverse used to solve the normal equations). The general implication of the theory of estimability of linear functions is that hypotheses which cannot be expressed as linear combinations of the rows of X (i.e., the combinations of observed levels of the categorical predictor variables) are not estimable, and therefore cannot be tested. Stated another way, we simply cannot test specific hypotheses that are not represented in the data. The notion of estimability is valuable because the test for estimability makes explicit which specific hypotheses can be tested and which cannot.

Linear Combinations of Effects. In multiple regression designs, it is common for hypotheses of interest to involve subsets of effects. In mixture designs, for example, we might be interested in simultaneously testing whether the main effect and any of the two-way interactions involving a particular predictor variable are non-zero. It is also common in multiple regression designs for hypotheses of interest to involves comparison of slopes. For example, we might be interested in whether the regression coefficients for two predictor variables differ. In both factorial regression and factorial ANOVA designs with many factors, it is often of interest whether sets of effects, say, all three-way and higher-order interactions, are nonzero. Tests of these types of specific hypotheses involve (1) constructing one or more Ls reflecting the hypothesis, (2) testing the estimability of the hypothesis by determining whether

L = L(X’X)X’X

and if so, using (3)

(Lb)’-L’)-1(Lb)

to estimate the sums of squares accounted for by the hypothesis. Finally, (4) the hypothesis is tested for significance using the usual mean square residual as the error term. To illustrate this 4-step procedure, suppose that a test of the difference in the regression slopes is desired for the (intercept plus) 2 predictor variables in a first-order multiple regression design. The coefficients for L would be

L = [0 1 -1]

(note that the first coefficient 0 excludes the intercept from the comparison) for which Lb is estimable if the 2 predictor variables are not redundant with each other. The hypothesis sums of squares reflect the difference in the partial regression coefficients for the 2 predictor variables, which is tested for significance using the mean square residual as the error term.

Planned Comparisons of Least Square Means. Usually, experimental hypotheses are stated in terms that are more specific than simply main effects or interactions. We may have the specific hypothesis that a particular textbook will improve math skills in males, but not in females, while another book would be about equally effective for both genders, but less effective overall for males. Now generally, we are predicting an interaction here: the effectiveness of the book is modified (qualified) by the student’s gender. However, we have a particular prediction concerning the nature of the interaction: we expect a significant difference between genders for one book, but not the other. This type of specific prediction is usually tested by testing planned comparisons of least squares means (estimates of the population marginal means), or as it is sometimes called, contrast analysis.

Briefly, contrast analysis allows us to test the statistical significance of predicted specific differences in particular parts of our complex design. The 4-step procedure for testing specific hypotheses is used to specify and test specific predictions. Contrast analysis is a major and indispensable component of the analysis of many complex experimental designs (see also for details).

To learn more about the logic and interpretation of contrast analysis refer to the ANOVA/MANOVA topic Overview section.

Post-Hoc Comparisons. Sometimes we find effects in an experiment that were not expected. Even though in most cases a creative experimenter will be able to explain almost any pattern of means, it would not be appropriate to analyze and evaluate that pattern as if we had predicted it all along. The problem here is one of capitalizing on chance when performing multiple tests post-hoc, that is, without a priori hypotheses. To illustrate this point, let’s consider the following “experiment.” Imagine we were to write down a number between 1 and 10 on 100 pieces of paper. We then put all of those pieces into a hat and draw 20 samples (of pieces of paper) of 5 observations each, and compute the means (from the numbers written on the pieces of paper) for each group. How likely do you think it is that we will find two sample means that are significantly different from each other? It is very likely! Selecting the extreme means obtained from 20 samples is very different from taking only 2 samples from the hat in the first place, which is what the test via the contrast analysis implies. Without going into further detail, there are several so-called post-hoc tests that are explicitly based on the first scenario (taking the extremes from 20 samples), that is, they are based on the assumption that we have chosen for our comparison the most extreme (different) means out of k total means in the design. Those tests apply “corrections” that are designed to offset the advantage of post-hoc selection of the most extreme comparisons. Whenever we find unexpected results in an experiment, we should use those post-hoc procedures to test their statistical significance.

### Testing Hypotheses for Repeated Measures and Dependent Variables

In the discussion of different hypotheses that can be tested using the general linear model, the tests have been described as tests for “the dependent variable” or “the outcome.” This has been done solely to simplify the discussion. When there are multiple dependent variables reflecting the levels of repeated measure factors, the general linear model performs tests using orthonormalized M-transformations of the dependent variables. When there are multiple dependent variables but no repeated measure factors, the general linear model performs tests using the hypothesis sums of squares and cross-products for the multiple dependent variables, which are tested against the residual sums of squares and cross-products for the multiple dependent variables. Thus, the same hypothesis testing procedures which apply to univariate designs with a single dependent variable also apply to repeated measure and multivariate designs.

## Real-Time Analytics for Sentiment Analysis, Marketing, and Media Mix Optimization

### Plugging into the Instant Feedback Loop

The marketing of brands and products has dramatically changed. Fewer key messages are disseminated through printed media, radio, and TV because of the delayed response to the campaigns days, weeks, or even months later. Instead, marketing campaigns today begin with a careful consideration of which specific web portals, search providers, social media, or blog spaces to target, and how to effectively communicate the message.

### The Instant Echo Chamber

Consumers today have a voice, and they have the instant media to make their voice heard. As a consequence, any confusing marketing messages or missteps will instantly affect the blogosphere, discussion groups, and social network sites, as the “buzz” quickly emerges in the echo chambers of the world.

This means that consumer responses expressed via web media can provide immediate feedback to your marketing team:

• To provide an accurate forecast of expected sales
• To identify problem areas, unexpected barriers, or any pushback
• To match refinements to the messages and echoing from the mix of media to improve marketing efficiency

Marketing > Buzz > Sales

The basic challenges are clear:

• How to determine which marketing channels to choose and how much to spend on each channel in order to reach your target audience
• How to link marketing activities to sentiment expressed by consumers on relevant web sites, blogs, discussion groups, social network sites, etc.
• How to link a reliable index of sentiment, or complex multivariate indices of consumer response and effect, to subsequent product sales
• How to put it all together to predict the expected success of an optimized marketing campaign based on the immediate feedback from consumers

### Putting It All Together: Predictive Modeling

The STATISTICA Enterprise solution for Social Media Mix Optimization provides an integrated system that is as responsive as the market and the messages reverberating through the web-based echo chambers themselves.

### Bringing Data Pieces Together

Social media response is obtainable in many formats and aggregations: from the users count, number of views, friends, or “Likes” that can be available daily, hourly, or even by the minute, to time stamped customer reviews that may not be updated as frequently. Configuring and maintaining all data sources in STATISTICA Enterprise and numericizing text fields with STATISTICA Text Miner combined with STATISTICA ETL (Extract, Transform, Load) functionality helps to solve this challenging task in an efficient and automated way.

### STATISTICA Data Miner and Predictive Modeling

The analytic engine driving the system is the STATISTICA Data Miner library of capabilities and algorithms, which builds accurate predictive models for linking variables from different sources.

The long-established Data Miner program is the most comprehensive, best tested, and universally acknowledged most versatile platform for predictive modeling, offering options for manual model building and configuring complete workflows within a visual programming environment.

### STATISTICA Text Miner

This program provides the high-capacity engine for indexing unstructured user-generated content (text) to extract the critical dimensions defining relevant sentiments expressed across multiple web sites, blogs, and social media sites efficiently and reliably. STATISTICA Text Miner equally serves the following purposes: meaning extraction, automatic text categorization, entity extraction, bringing unstructured data to numeric form, and concept extraction with Singular Value Decomposition (SVD).

### STATISTICA Enterprise

This system provides the robust and scalable server backbone for automating the analytics, linking marketing expenditures to consumer sentiment, and linking consumer sentiment to expected demand (and sales). STATISTICA Enterprise also provides the display layer to manage large numbers of channels via efficient and hierarchically nested dashboards that will alert/alarm when undesirable trends are detected.

### Optimizing the Media Mix

Once a complete system is in place that reliably tracks the relationships between marketing expenditures and customer sentiment, the system can be optimized using powerful “what-if” scenario analyses to identify the optimal combinations of expenditures for different advertising and marketing channels. Predictive models will be built to establish confidence regions around the formula for the optimal mix to empower marketing or product managers to evaluate risk/reward scenarios, and ultimately, turn the buzz into sales.

### Key Features Summary

• Central Configuration and Management
• Data Connections, Aggregation, and Alignment across different departments within organization. Data configurations are stored as metadata and serve as templates for subsequent analyses and analytic workflows
• Measure Marketing Success and Sales Conversion in one Platform
• Final Solution can Embrace Data Collection with Data Historian Functionalities or be Easily Integrated with Existing Infrastructure

## Comprehensive Analytic Modules

STATISTICA Multivariate Exploratory Techniquesoffers a broad selection of exploratory techniques, from cluster analysis to advanced classification trees methods, with a comprehensive array of interactive visualization tools for exploring relationships and patterns; built-in complete Visual Basic scripting.

• Cluster Analysis Techniques
• Factor Analysis and Principle Components
• Canonical Correlation Analysis
• Reliability/Item Analysis
• Classification Trees
• Correspondence Analysis
• Multidimensional Scaling
• Discriminant Analysis
• General Discriminant Analysis Models
• STATISTICA Visual Basic Language, and more.

STATISTICA Advanced Linear/Nonlinear Models contains a wide array of the most advanced linear and nonlinear modeling tools on the market, supports continuous and categorical predictors, interactions, hierarchical models; automatic model selection facilities; also, includes variance components, time series, and many other methods; all analyses include extensive, interactive graphical support and built-in complete Visual Basic scripting.

• Distribution and Simulation
• Variance Components and Mixed Model ANOVA/ANCOVA
• Survival/Failure Time Analysis
• General Nonlinear Estimation (and Logit/Probit)
• Log-Linear Analysis
• Time Series Analysis, Forecasting
• Structural Equation Modeling/Path Analysis (SEPATH)
• General Linear Models (GLM)
• General Regression Models (GRM)
• Generalized Linear/Nonlinear Models (GLZ)
• Partial Least Squares (PLS)
• STATISTICA Visual Basic Language, and more.

STATISTICA Power Analysis and Interval Estimationis an extremely precise and user-friendly research tool for analyzing all aspects of statistical power and sample size calculation.

• Power Calculations
• Sample Size Calculations
• Interval Estimation
• Probability Distribution Calculators, and more.

## STATISTICA Automated Neural Networks

Request Price From StatSoft

STATISTICA Automated Neural Networkscontains a comprehensive array of statistics, charting options, network architectures, and training algorithms; C and PMML (Predictive Model Markup Language) code generators. The C code generator is an add-on.

Fully integrated with the STATISTICA system.

• A selection of the most popular network architectures including Multilayer Perceptrons, Radial Basis Function networks, Linear Networks and Self Organizing Feature Maps.
• State-of-the-art training algorithms including:
Conjugate Gradient Descent, BFGS, Kohonen training, k-Means Center Assignment
• Forming ensembles of networks for better prediction performance
• Automatic Network Search, a tool for automating neural network architecture and complexity selection
• Best Network Retention, and more.
• Supporting various statistical analysis and model predictive model building including regression, classification, time series regression, time series classification and cluster analysis for dimensionality reduction and visualization.
• Fully supports deployment of multiple models

## STATISTICA Automated Neural Networks Code Generator

Request Price From StatSoft

STATISTICA Automated Neural Networks Code Generator can generate neural network code in both C and PMML (Predictive Model Markup Language) languages. The Code Generator Add-on enables STATISTICA Automated Neural Networks users to generate a C code file to be used for compiling a C program based on the output of a neural networks analysis.

• The C code generator add-on requires STATISTICA Neural Networks
• Generates a source code version of a neural network (in C or C++ file) which can be compiled with all C or C++ compilers.
• C code file can then integrated into external programs.

## STATISTICA Base

Request Price From StatSoft

STATISTICA Baseoffers a comprehensive set of essential statistics in a user-friendly package with flexible output management and Web enablement features; it also includes all STATISTICA graphics tools and a comprehensive Visual Basic development environment. The program is shipped on CD ROM.

• Descriptive Statistics, Breakdowns, and Exploratory Data Analysis
• Correlations
• Interactive Probability Calculator
• T-Tests (and other tests of group differences)
• Frequency Tables, Crosstabulation Tables, Stub-and-Banner Tables, Multiple Response Analysis
• Multiple Regression Methods
• Nonparametric Statistics
• Distribution Fitting
• Enhanced graphics technology
• Powerful query tools
• Flexible data management
• ANOVA [supports 4 between factors and 1 within (repeated measure) factor]
• STATISTICA Visual Basic Language, and more.

## STATISTICA Data Miner

Request Price From StatSoft

Includes the functionality of all of the following:

STATISTICA Automated Neural Networks

STATISTICA Data Miner contains the most comprehensive selection of data mining solutions on the market, with an icon-based, extremely easy-to-use user interface. It features a selection of completely integrated, and automated, ready to deploy “as is” (but also easily customizable) specific data mining solutions for a wide variety of business applications. The product is offered optionally with deployment and on-site training services. The data mining solutions are driven by powerful procedures from five modules, which can also be used interactively and/or used to build, test, and deploy new solutions.

• General Slicer/Dicer Explorer
• General Classifier
• General Modeler/Multivariate Explorer
• General Forecaster
• General Neural Networks Explorer, and more.

Solution Packages to meet specific needs are available.

## STATISTICA Scorecard

Request Price From StatSoft

STATISTICA Scorecard, a software solution for developing, evaluating, and monitoring scorecard models, includes the following capabilities and workflow:

• Data preparation
• Modelling
• Evaluation and calibration
• Monitoring

## STATISTICA Data Warehouse

Request Price From StatSoft

STATISTICA Data Warehouse is the ultimate high-performance, scalable system for intelligent management of unlimited amounts of data, distributed across locations worldwide.

## STATISTICA Document Management System

Request Price From StatSoft

STATISTICA Document Management System is a scalable solution for flexible, productivity-enhancing management of local or Web-based document repositories (FDA/ISO compliant).

## STATISTICA Enterprise

Request Price From StatSoft

STATISTICA Enterprise is an integrated multi-user software system designed for general purpose data analysis and business intelligence applications in research, marketing, finance, and other industries. STATISTICA Enterprise provides an efficient interface to enterprise-wide data repositories and a means for collaborative work as well as all the statistical functionality available in STATISTICA Base, STATISTICA Advanced Models, and STATISTICA Exploratory Techniques (optionally also STATISTICA Automated Neural Networks and STATISTICA Power Analysis and Interval Estimation).

• An efficient general interface to enterprise-wide repositories of data
• A means for collaborative work (groupware functionality)
• A reporting tool for formatted documents (PDF, HTML, MS Word) and analysis summaries of any of the tabular and graphical results produced by STATISTICA.
• Compatible with (and linkable to) industry-standard enterprise-wide database management systems
• Custom configurations including any applications from the STATISTICA product line, and more.

## STATISTICA Enterprise / Quality Control (QC)

Request Price From StatSoft

STATISTICA’s comprehensive array of both routine and high-end statistical analyses, superior graphing technology, and unparalleled record of reviews gives STATISTICA Enterprise/QCmany advantages over competing products. A unique combination of features not found in any other SPC system makes STATISTICA Enterprise/QC the most comprehensive SPC System available.

• Real-time analytical tools
• A high performance database
• Groupware functionality for sharing queries, special applications, etc.
• A sophisticated reporting tool for web-based output
• Built-in security system
• User-specific interfaces
• Open-ended alarm notification including cause/action prompts
• Interactive querying facilities
• Integration with external applications (Word, Excel, browsers)
• and much, much more…

## STATISTICA Enterprise Web Viewer

Request Price From StatSoft

STATISTICA Enterprise Web Viewer provides the ability to view analyses and reports that were generated within STATISTICA Enterpriseor STATISTICA Enterprise / QC. This allows companies to protect their data and reports with the STATISTICA Enterprise security model.

## STATISTICA Extract, Transform, and Load (ETL)

Request Price From StatSoft

STATISTICA Extract, Transform, and Load (ETL) provides options to simplify and facilitate access to, aggregation, and alignment of data from multiple databases, when some of the databases contain process data (using the optional PI Connector), while others contain “static” data (e.g., from Oracle or MS SQL Server). Provides for ad-hoc querying and aligning of data, for subsequent analyses such as ad-hoc charting etc. of data describing a specific time interval.

## STATISTICA Live Score

Request Price From StatSoft

STATISTICA Live Score is STATISTICA Server software within the STATISTICA Data Analysis and Data Mining Platform. Data are aggregated & cleaned and models are trained & validated using the STATISTICA Data Miner software. Once the models are validated, they are deployed to the STATISTICA Live Score server.   STATISTICA Live Score provides multi-threaded, efficient, and platform-independent scoring of data from line-of-business applications.

## STATISTICA Monitoring and Alerting Server (MAS)

Request Price From StatSoft

STATISTICA Monitoring and Alerting Server (MAS)is a system that enables users to automate the continual monitoring of hundreds or thousands of critical process and product parameters.

## STATISTICA MultiStream™ for Pharmaceutical Industries

Request Price From StatSoft

STATISTICA MultiStream for Pharmaceutical Industriesis a solution package for identifying and implementing effective strategies for advanced multivariate process monitoring and control. STATISTICA MultiStream was designed for process industries in general, but is particularly well suited to help pharmaceutical manufacturers leverage the data collected into their existing specialized process data bases for multivariate and predictive process control.

## STATISTICA MultiStream™ for Power Industries

Request Price From StatSoft

STATISTICA MultiStream for Power Industries is a solution package for identifying and implementing effective strategies for advanced multivariate process monitoring and control. STATISTICA MultiStream was designed for process industries in general, but is particularly well suited to help power generation facilities leverage the data collected into their existing specialized process data bases for multivariate and predictive process control, for actionable advisory systems.

## STATISTICA Multivariate Statistical Process Control (MSPC)

Request Price From StatSoft

STATISTICA Multivariate Statistical Process Control (MSPC) is a complete solution for multivariate statistical process control, deployed within a scalable, secure analytics software platform.

## STATISTICA PI Connector

Request Price From StatSoft

STATISTICA PI Connector is an optional STATISTICA add-on component that allows for direct integration to data stored in the PI data historian. The STATISTICA PI Connector utilizes the PI user access control and security model, allows for interactive browsing of tags, and takes advantages of dedicated PI functionality for interpolation and snapshot data. STATISTICA integrated with the PI system is being used for streamlined and automated analyses for applications such as Process Analytical Technology (PAT) in FDA-regulated industries, Advanced Process Control (APC) systems in Chemical and Petrochemical industries, and advisory systems for process optimization and compliance in the Energy Utility industry.

## STATISTICA Process Optimization

Request Price From StatSoft

STATISTICA Process Optimization, an optional extension of STATISTICA Data Miner, is a powerful software solution designed to monitor processes and identify and anticipate problems related to quality control and improvement with unmatched sensitivity and effectiveness. STATISTICA Process Optimization integrates all quality control charts, process capability analyses, experimental design procedures, and Six Sigma methods with a comprehensive library of cutting-edge techniques for exploratory and predictive data mining.

STATISTICA Process Optimization enables its users to:

• Predict QC problems with cutting edge data mining methods
• Discover root causes of problem areas
• Monitor and improve ROI (Return On Investment)
• Generate suggestions for improvement
• Monitor processes in real time over the Web
• Create and deploy QC/SPC solutions over the Web
• Use multithreading and distributed processing to rapidly process extremely large streams of data.
• General Optimization

Solution Packages to meet specific needs are available.

## STATISTICA Quality Control (QC)

Request Price From StatSoft

Includes the functionality of all of the following:

STATISTICA Base

STATISTICA Quality Control Charts offers versatile presentation-quality charts with a selection of automation options, customizable features, and user-interface shortcuts to simplify routine work.

• Quality Control Charts
• Interactive Quality Control Charts including:
Real-time updating of charts, automatic alarm notification, shop floor mode, assigning causes and actions, analytic brushing, and dynamic project management
• Multivariate Quality Control Charts including: Hotelling T-Square Charts, Multiple Stream (Group), Multivariate Exponentially Moving Average (MEWMA) charts, Multivariate Cumulative Sum (MCUSUM) Charts, Generalized Variance Charts
• STATISTICA Visual Basic Language, and more.

STATISTICA Process Analysis is a comprehensive package for process capability, Gage R&R, and other quality control/improvement applications.

• Process Capability Analysis
• Weibull Analysis
• Gage Repeatability & Reproducibility
• Sampling Plans
• Variance Components, and more.

STATISTICA Design of Experiments features the largest selection of DOE, visualization and other analytic techniques including powerful desirability profilers and extensive residual statistics.

• Fractional Factorial Designs
• Mixture Designs
• Latin Squares
• Search for Optimal 2**k-p Designs
• Residual Analysis and Transformations
• Optimization of Single or Multiple Response Variables
• Central Composite Designs
• Taguchi Designs
• Desirability Profiler
• Minimum Aberration and Maximum Unconfounding 2**k-p Fractional Factorial Designs with Blocks
• Constrained Surfaces
• D- and A-optimal Designs, and more.

STATISTICA Power Analysis and Interval Estimationis an extremely precise and user-friendly research tool for analyzing all aspects of statistical power and sample size calculation.

• Power Calculations
• Sample Size Calculations
• Interval Estimation
• Probability Distribution Calculators, and more.

## STATISTICA Sequence, Association, and Link Analysis (SAL)

Request Price From StatSoft

STATISTICA Sequence, Association and Link Analysis (SAL) is designed to address the needs of clients in retailing, banking and insurance, etc., industries by implementing the fastest known highly scalable algorithm with the ability to drive Association and Sequence rules in one single analysis. The program represents a stand-alone module that can be used for both model building and deployment. All tools in STATISTICA Data Miner can be quickly and effortlessly leveraged to analyze and “drill into” results generated via STATISTICA SAL.

• Uses a Tree-Building technique to extract Association and Sequence rules from data
• Uses efficient and thread-safe local relational Database technology to store Association and Sequence models
• Handles multiple response, multiple dichotomy and continuous variables in one analysis
• Performs Sequence analysis while mining for Association rules in a single analysis
• Simultaneously extracts Association and Sequence rules for more than one dimension
• Given the ability to perform multidimensional Association and Sequence mining and the capacity to extract only rules for specific items, the program can be used for Predictive Data Mining
• Performs Hierarchical Single-Linkage Cluster analysis which can detect the more likely cluster of items that can occur. This has extremely useful, practical real-world applications such as in Retailing.

## STATISTICA Text Miner

Request Price From StatSoft

STATISTICA Text Mineris an optional extension of STATISTICA Data Miner. The program features a large selection of text retrieval, pre-processing, and analytic and interpretive mining procedures for unstructured text data (including Web pages), with numerous options for converting text into numeric information (for mapping, clustering, predictive data mining, etc.), language-specific stemming algorithms. Because STATISTICA’s flexible data import options, the methods available in STATISTICA Text Miner can also be useful for processing other unstructured input (e.g., image files imported as data matrices, etc.).

## STATISTICA Web Based Data Entry

Request Price From StatSoft

STATISTICA Web Data Entryenables companies to configure data entry scenarios to allow data entry via Web browsers and the analysis of these data using all of the graphical data analysis, statistical analysis, and data mining capabilities of the STATISTICA Enterprise software platform

STATISTICA Web Data Entry builds on the configuration objects in STATISTICA Enterprise:

• Characteristics: Numeric data to be collected for analysis (e.g., pH)
• Labels: Text or date data for traceability (e.g., Lot Number)
• Data Entry Setups: Groups of Characteristics and Labels configured with specific User/Group permissions to collect the appropriate data for particular scenarios

## STATISTICA Variance Estimation and Precision

Request Price From StatSoft

STATISTICA Variance Estimation and Precision is a comprehensive set of techniques for analyzing data from experiments that include both fixed and random effects using REML (Restricted Maximum Likelihood Estimation). With Variance Estimation and Precision, users can obtain estimates of variance components and use them to make precision statements while at the same time comparing fixed effects in the presence of multiple sources of variation.

Variance Estimation and Precision includes the following:

• Variability plots
• Multiple plot layouts to allow direct comparison of multiple dependent variables
• Expected mean squares and variance components with confidence intervals
• Flexible handling of multiple dependent variables: analyze several variables with the same or different designs at once
• Graph displays of variance components

## WebSTATISTICA Knowledge Portal

Request Price From StatSoft

WebSTATISTICA Knowledge Portal is the ultimate knowledge-sharing tool. It incorporates the latest Internet technology and includes a powerful, flexible report generation tool and a secure system for information delivery.

## WebSTATISTICA Server Applications

Request Price From StatSoft

WebSTATISTICA Server Applications is the ultimate enterprise system that offers full Web enablement, including the ability to run STATISTICA interactively or in batch from a Web browser on any computer (incl. Linux, UNIX), offload time consuming tasks to the servers (using distributed processing), use multi-tier Client-Server architecture, manage projects over the Web, and collaborate “across the hall or across continents.”

• Work collaboratively “across the hall” or “across continents”
• Run STATISTICA using any computer in the world (connected to the Internet