Monthly Archives: August 2012

STATISTICA Process Optimization

STATISTICA Process Optimization is a specialized add-on to STATISTICA Data Minerand it offers a powerful software solution designed to analyze, monitor, and optimize complex processes based on all available data. The products combine the most powerful tools for monitoring quality and other characteristics of a process with cutting-edge data mining technology and process optimization capabilities.


STATISTICA Process Optimization is applicable to any situation where (often a large amount of) data are available describing a complex organizational, service-delivery, manufacturing, or continuous or batch manufacturing environment.

  • Typical Areas of Application
  • A Complete Solution
  • Integration with the STATISTICA System
  • Go Beyond Simple Process Monitoring and Single-Factor Optimization




Typical Areas of Application

Using years’ worth of past data collected for a manufacturing process, STATISTICA Process Optimization can find trends that reoccur over time. These trends are then used to predict means, minimums, maximums, and ranges for samples not yet created. Understanding the sample trends and forecasting future samples proves invaluable to the manufacturing process.

In a process with hundreds or thousands of process inputs, all potentially affecting the final product, STATISTICA Process Optimization will determine a subset of those predictors with the most influence. This allows us to focus on a hand full of important parameters of the complex process while at the same time, giving greater ability to influence the final product.

A Complete Solution

The tools available in STATISTICA Process Optimization, and the supporting services available through StatSoft’s specialized consulting practice world-wide, will allow StatSoft clients to apply the cutting edge technologies required in today’s challenging business environment, in order to boost competitiveness and achieve superior return on investment for sustained success. This specialized add-on to STATISTICA Data Minerand other tools is offered either (a) as a complete “turn-key” (deployed) solution, custom tailored by StatSoft consultants and engineers to your specific needs, and/or (b) as a comprehensive set of tools that enables you to easily build new or customize existing solutions.

STATISTICA Process Optimization provides these advanced features of the STATISTICA system:

  1. Methods and algorithms for modeling complex processes (e.g., for building predictive models of quality, process outcomes, key performance indicators; model-based process monitoring)
  2. Advanced methods for root cause analysis (e.g., to identify important process parameters from among thousands of parameters available for process monitoring)
  3. Optimization (e.g., optimize arbitrary cost functions based on one or more models for key performance/process outcomes)
  4. Simulation (e.g., simulate non-normal multivariate processes and models to identify expected performance, reliability, etc.)

STATISTICA Process Optimization integrates:

  1. The most complete comprehensive selection of data mining methods offered in STATISTICA Data Miner, tools for root cause analysis, STATISTICA software for quality control and improvement, STATISTICA Multivariate Quality Control charting, and software for uncovering trends/drift, explaining known patterns, and predicting/forecasting.
  2. Specialized, unique tools for multivariate process simulation.
  3. Specialized, unique tools for process optimization of arbitrary goal functions (and data mining models for one or more outcomes and/or cost functions).
  4. The complete set of superior analytic graphics and exploratory methods for drill down and problem “understanding”.

Specifically, STATISTICA Process Optimization provides capabilities for:

Statistical Process Control (SPC), Multivariate SPC, Advanced Process Monitoring

  • All quality control charts, multivariate quality control charts, process capability analyses, experimental design procedures, or Six Sigma methods and charts are integrated with a comprehensive library of cutting-edge techniques for exploratory and predictive data mining.
  • Capabilities for connecting directly to external databases, to monitor live data streams
  • Methods for standard control charting, trending, as well as multivariate charting and trending (e.g., MCUSUM, MEWMA, T?)

Data Mining

  • All tools and methods for data exploration and review, to identify excursions, patterns, etc.
  • All tools and methods for data preparation (e.g., recoding of outliers and excursions, identification and imputation of missing data, etc.)
  • All methods, algorithms, and techniques available in STATISTICA Data Miner, which contains the most comprehensive collection of data mining algorithms and methods in a single package (e.g., including Automated Neural Networks, various Tree/Recursive-Partitioning algorithms, Boosted Trees and Forests, Support Vector Machine, Multivariate Adaptive Regression Splines, etc.), providing the most powerful toolset available for modeling/predicting complex process outcomes


  • Optimization of multiple data mining prediction models (e.g., application of Simplex optimization, Genetic Algorithm optimization, etc., to minimize/maximize the prediction from one or more predictive data mining models)
  • Generic optimization of user-defined loss/cost functions (e.g., application of Simplex optimization, Genetic Algorithm optimization, Grid search methods to optimize arbitrary goal functions defined by the user via the convenient STATISTICA Visual Basic (SVB) scripting language; thus; multiple models for multiple outcomes (complex cost functions) can be optimized simultaneously; note that STATISTICA can be integrated with the WebSTATISTICA environment, so that optimization can be offloaded to powerful 64-bit multi-processor servers to tackle problems of significant complexity)


  • Distribution fitting to multivariate datasets; automatic selection of normal, non-normal, or mixtures of distributions
  • Estimation of variable covariance matrices
  • Advanced simulation of multivariate non-normal distributions and covariance structures, to identify expected distributions of reliability, yield, quality, success/response-rate to marketing campaign strategies, investment and portfolio strategies, etc.

Integration with the STATISTICA System

STATISTICA Process Optimization is part of the STATISTICA family of software products, and is fully integrated with all tools for enterprise deployment.

  • For example, data mining prediction models can be automatically connected to (Web)STATISTICA Enterprise for model-based predictive quality control, virtual sensors, etc. (see also references and white papers listed below) deployed against live data streams.
  • Data Mining projects and modeling can be off-loaded to more powerful servers running WebSTATISTICA; for example, optimization can be performed simultaneously over multiple server processors.
  • Model based process monitoring models (e.g., for control of batch maturation processes, continuous processes such as power generation, chemical manufacturing, etc.) can be deployed to STATISTICA Monitoring and Alerting Server, to enable enterprise-wide comprehensive multivariate process monitoring.

Go Beyond Simple Process Monitoring and Single-Factor Optimization

StatSoft has been at the forefront of providing practical and effective solutions for advanced process monitoring and optimization worldwide for nearly two decades!

Hill, T., Eames, R., Lahoti, S. (2008). Finding direction in chaos: Data mining methods make sense out of millions of seemingly random data points. Quality Digest, December, 20-23.

Hill, T., EPRI/StatSoft Project 44771: Statistical Use of Existing DCS Data for Process Optimization, EPRI, Palo Alto, CA, 2008). (Note: This paper is available to EPRI members. It can also be purchased. Search for paper title on

Grichnik, T., Hill, T., & Seskin, M. (2006). Predicting quality outcomes through data mining. Quality Digest, September, 42-47.

Lewicki, P; Hill, T; Qazaz, C. (2007). Multivariate Quality Control. Quality Magazine, April, 38-45.


System Requirements


STATISTICA Process Optimization is compatible with Windows XP, Windows Vista, and Windows 7.

Minimum System Requirements

  • Operating System: Windows XP or above
  • RAM: 1 GB
  • Processor Speed: 2.0 GHz

Recommended System Requirements

  • Operating System: Windows XP or above
  • RAM: 4 GB or more
  • Processor Speed: 2.0 GHz, 64-bit, dual core 

Native 64-bit versions and highly optimized multiprocessor versions are available.

What is Quality Control and Quality Control Charts?

In all production processes, we need to monitor the extent to which our products meet specifications. In the most general terms, there are two “enemies” of product quality:

  • deviations from target specifications
  • excessive variability around target specifications

During the earlier stages of developing the production process, designed experiments are often used to optimize these two quality characteristics (see Experimental Design); the methods provided in Quality Control are on-line or in-process quality control procedures to monitor an on-going production process. For detailed descriptions of these charts and extensive annotated examples, see Buffa (1972), Duncan (1974) Grant and Leavenworth (1980), Juran (1962), Juran and Gryna (1970), Montgomery (1985, 1991), Shirland (1993), or Vaughn (1974). Two excellent introductory texts with a “how-to” approach are Hart & Hart (1989) and Pyzdek (1989); two German language texts on this subject are Rinne and Mittag (1995) and Mittag (1993).


General Approach

The general approach to on-line quality control is straightforward: We simply extract samples of a certain size from the ongoing production process. We then produce line charts of the variability in those samples and consider their closeness to target specifications. If a trend emerges in those lines, or if samples fall outside pre-specified limits, we declare the process to be out of control and take action to find the cause of the problem. These types of charts are sometimes also referred to as Shewhart control charts (named after W. A. Shewhart, who is generally credited as being the first to introduce these methods; see Shewhart, 1931).

Interpreting the chart. The most standard display actually contains two charts (and two histograms); one is called an X-bar chart, the other is called an R chart.

In both line charts, the horizontal axis represents the different samples; the vertical axis for the X-bar chart represents the means for the characteristic of interest; the vertical axis for the R chart represents the ranges. For example, suppose we want to control the diameter of piston rings that we are producing. The center line in the X-bar chart would represent the desired standard size (e.g., diameter in millimeters) of the rings, while the center line in the R chart would represent the acceptable (within-specification) range of the rings within samples; thus, this latter chart is a chart of the variability of the process (the larger the variability, the larger the range). In addition to the center line, a typical chart includes two additional horizontal lines to represent the upper and lower control limits (UCL, LCL, respectively); we will return to those lines shortly. Typically, the individual points in the chart, representing the samples, are connected by a line. If this line moves outside the upper or lower control limits or exhibits systematic patterns across consecutive samples (see Runs Tests), a quality problem may potentially exist.

Elementary Concepts discusses the concept of the sampling distribution and the characteristics of the normal distribution. The method for constructing the upper and lower control limits is a straightforward application of the principles described there.

Establishing Control Limits

Even though we could arbitrarily determine when to declare a process out of control (that is, outside the UCL-LCL range), it is common practice to apply statistical principles to do so.

Example. Suppose we want to control the mean of a variable, such as the size of piston rings. Under the assumption that the mean (and variance) of the process does not change, the successive sample means will be distributed normally around the actual mean. Moreover, without going into details regarding the derivation of this formula, we also know (because of the central limit theorem, and thus approximate normal distribution of the means; see, for example, Hoyer and Ellis, 1996) that the distribution of sample means will have a standard deviation of Sigma (the standard deviation of individual data points or measurements) over the square root of n (the sample size). It follows that approximately 95% of the sample means will fall within the limits ± 1.96 * Sigma/Square Root(n) (refer to Elementary Concepts for a discussion of the characteristics of the normal distribution and the central limit theorem). In practice, it is common to replace the 1.96 with 3 (so that the interval will include approximately 99% of the sample means) and to define the upper and lower control limits as plus and minus 3 sigma limits, respectively.

General case. The general principle for establishing control limits just described applies to all control charts. After deciding on the characteristic we want to control, for example, the standard deviation, we estimate the expected variability of the respective characteristic in samples of the size we are about to take. Those estimates are then used to establish the control limits on the chart.

Common Types of Charts

The types of charts are often classified according to the type of quality characteristic that they are supposed to monitor: there are quality control charts for variables and control charts for attributes. Specifically, the following charts are commonly constructed for controlling variables:

  • X-bar chart. In this chart, the sample means are plotted in order to control the mean value of a variable (e.g., size of piston rings, strength of materials, etc.).
  • R chart. In this chart, the sample ranges are plotted in order to control the variability of a variable.
  • S chart. In this chart, the sample standard deviations are plotted in order to control the variability of a variable.
  • S**2 chart. In this chart, the sample variances are plotted in order to control the variability of a variable.

For controlling quality characteristics that represent attributes of the product, the following charts are commonly constructed:

  • C chart. In this chart (see example below), we plot the number of defectives (per batch, per day, per machine, per 100 feet of pipe, etc.). This chart assumes that defects of the quality attribute are rare, and the control limits in this chart are computed based on the Poisson distribution(distribution of rare events).
  • U chart. In this chart we plot the rate of defectives, that is, the number of defectives divided by the number of units inspected (the n; e.g., feet of pipe, number of batches). Unlike the C chart, this chart does not require a constant number of units, and it can be used, for example, when the batches (samples) are of different sizes.
  • Np chart. In this chart, we plot the number of defectives (per batch, per day, per machine) as in the C chart. However, the control limits in this chart are not based on the distribution of rare events, but rather on the binomial distribution. Therefore, this chart should be used if the occurrence of defectives is not rare (e.g., they occur in more than 5% of the units inspected). For example, we may use this chart to control the number of units produced with minor flaws.
  • P chart. In this chart, we plot the percent of defectives (per batch, per day, per machine, etc.) as in the U chart. However, the control limits in this chart are not based on the distribution of rare events but rather on the binomial distribution (of proportions). Therefore, this chart is most applicable to situations where the occurrence of defectives is not rare (e.g., we expect the percent of defectives to be more than 5% of the total number of units produced).

All of these charts can be adapted for short production runs (short run charts), and for multiple process streams.

Short Run Control Charts

The short run control chart, or control chart for short production runs, plots observations of variables or attributes for multiple parts on the same chart. Short run control charts were developed to address the requirement that several dozen measurements of a process must be collected before control limits are calculated. Meeting this requirement is often difficult for operations that produce a limited number of a particular part during a production run.

For example, a paper mill may produce only three or four (huge) rolls of a particular kind of paper (i.e., part) and then shift production to another kind of paper. But if variables, such as paper thickness, or attributes, such as blemishes, are monitored for several dozen rolls of paper of, say, a dozen different kinds, control limits for thickness and blemishes could be calculated for the transformed (within the short production run) variable values of interest. Specifically, these transformations will rescale the variable values of interest such that they are of compatible magnitudes across the different short production runs (or parts). The control limits computed for those transformed values could then be applied in monitoring thickness, and blemishes, regardless of the types of paper (parts) being produced. Statistical process control procedures could be used to determine if the production process is in control, to monitor continuing production, and to establish procedures for continuous quality improvement.

For additional discussions of short run charts refer to Bhote (1988), Johnson (1987), or Montgomery (1991).

Short Run Charts for Variables

Nominal chart, target chart. There are several different types of short run charts. The most basic are the nominal short run chart, and the target short run chart. In these charts, the measurements for each part are transformed by subtracting a part-specific constant. These constants can either be the nominal values for the respective parts (nominal short run chart), or they can be target values computed from the (historical) means for each part (Target X-bar and R chart). For example, the diameters of piston bores for different engine blocks produced in a factory can only be meaningfully compared (for determining the consistency of bore sizes) if the mean differences between bore diameters for different sized engines are first removed. The nominal or target short run chart makes such comparisons possible. Note that for the nominal or target chart it is assumed that the variability across parts is identical, so that control limits based on a common estimate of the process sigma are applicable.

Standardized short run chart. If the variability of the process for different parts cannot be assumed to be identical, then a further transformation is necessary before the sample means for different parts can be plotted in the same chart. Specifically, in the standardized short run chart the plot points are further transformed by dividing the deviations of sample means from part means (or nominal or target values for parts) by part-specific constants that are proportional to the variability for the respective parts. For example, for the short run X-bar and R chart, the plot points (that are shown in the X-bar chart) are computed by first subtracting from each sample mean a part specific constant (e.g., the respective part mean, or nominal value for the respective part), and then dividing the difference by another constant, for example, by the average range for the respective chart. These transformations will result in comparable scales for the sample means for different parts.

Short Run Charts for Attributes

For attribute control charts (C, U, Np, or P charts), the estimate of the variability of the process (proportion, rate, etc.) is a function of the process average (average proportion, rate, etc.; for example, the standard deviation of a proportion p is equal to the square root of p*(1- p)/n). Hence, only standardized short run charts are available for attributes. For example, in the short run P chart, the plot points are computed by first subtracting from the respective sample p values the average part p‘s, and then dividing by the standard deviation of the average p‘s.

Unequal Sample Sizes

When the samples plotted in the control chart are not of equal size, then the control limits around the center line (target specification) cannot be represented by a straight line. For example, to return to the formula Sigma/Square Root(n) presented earlier for computing control limits for the X-bar chart, it is obvious that unequal n‘s will lead to different control limits for different sample sizes. There are three ways of dealing with this situation.

Average sample size. If you want to maintain the straight-line control limits (e.g., to make the chart easier to read and easier to use in presentations), then you can compute the average n per sample across all samples, and establish the control limits based on the average sample size. This procedure is not “exact,” however, as long as the sample sizes are reasonably similar to each other, this procedure is quite adequate.

Variable control limits. Alternatively, you may compute different control limits for each sample, based on the respective sample sizes. This procedure will lead to variable control limits, and result in step-chart like control lines in the plot. This procedure ensures that the correct control limits are computed for each sample. However, you lose the simplicity of straight-line control limits.

Stabilized (normalized) chart. The best of two worlds (straight line control limits that are accurate) can be accomplished by standardizing the quantity to be controlled (mean, proportion, etc.) according to units of sigma. The control limits can then be expressed in straight lines, while the location of the sample points in the plot depend not only on the characteristic to be controlled, but also on the respective sample n‘s. The disadvantage of this procedure is that the values on the vertical (Y) axis in the control chart are in terms of sigma rather than the original units of measurement, and therefore, those numbers cannot be taken at face value (e.g., a sample with a value of 3 is 3 times sigma away from specifications; in order to express the value of this sample in terms of the original units of measurement, we need to perform some computations to convert this number back).

Control Charts for Variables vs. Charts for Attributes

Sometimes, the quality control engineer has a choice between variable control charts and attribute control charts.

Advantages of attribute control charts. Attribute control charts have the advantage of allowing for quick summaries of various aspects of the quality of a product, that is, the engineer may simply classify products as acceptable or unacceptable, based on various quality criteria. Thus, attribute charts sometimes bypass the need for expensive, precise devices and time-consuming measurement procedures. Also, this type of chart tends to be more easily understood by managers unfamiliar with quality control procedures; therefore, it may provide more persuasive (to management) evidence of quality problems.

Advantages of variable control charts. Variable control charts are more sensitive than attribute control charts (see Montgomery, 1985, p. 203). Therefore, variable control charts may alert us to quality problems before any actual “unacceptables” (as detected by the attribute chart) will occur. Montgomery (1985) calls the variable control charts leading indicators of trouble that will sound an alarm before the number of rejects (scrap) increases in the production process.

Control Chart for Individual Observations

Variable control charts can by constructed for individual observations taken from the production line, rather than samples of observations. This is sometimes necessary when testing samples of multiple observations would be too expensive, inconvenient, or impossible. For example, the number of customer complaints or product returns may only be available on a monthly basis; yet, you want to chart those numbers to detect quality problems. Another common application of these charts occurs in cases when automated testing devices inspect every single unit that is produced. In that case, you are often primarily interested in detecting small shifts in the product quality (for example, gradual deterioration of quality due to machine wear). The CUSUM, MA, and EWMA charts of cumulative sums and weighted averages discussed below may be most applicable in those situations.

Out-Of-Control Process: Runs Tests

As mentioned earlier in the introduction, when a sample point (e.g., mean in an X-bar chart) falls outside the control lines, you have reason to believe that the process may no longer be in control. In addition, you should look for systematic patterns of points (e.g., means) across samples, because such patterns may indicate that the process average has shifted. These tests are also sometimes referred to as AT&T runs rules (see AT&T, 1959) or tests for special causes (e.g., see Nelson, 1984, 1985; Grant and Leavenworth, 1980; Shirland, 1993). The term special or assignable causes as opposed to chance or common causes was used by Shewhart to distinguish between a process that is in control, with variation due to random (chance) causes only, from a process that is out of control, with variation that is due to some non-chance or special (assignable) factors (cf. Montgomery, 1991, p. 102).

As the sigma control limits discussed earlier, the runs rules are based on “statistical” reasoning. For example, the probability of any sample mean in an X-bar control chart falling above the center line is equal to 0.5, provided (1) that the process is in control (i.e., that the center line value is equal to the population mean), (2) that consecutive sample means are independent (i.e., not auto-correlated), and (3) that the distribution of means follows the normal distribution. Simply stated, under those conditions there is a 50-50 chance that a mean will fall above or below the center line. Thus, the probability that two consecutive means will fall above the center line is equal to 0.5 times 0.5 = 0.25.

Accordingly, the probability that 9 consecutive samples (or a run of 9 samples) will fall on the same side of the center line is equal to 0.5**9 = .00195. Note that this is approximately the probability with which a sample mean can be expected to fall outside the 3- times sigma limits (given the normal distribution, and a process in control). Therefore, you could look for 9 consecutive sample means on the same side of the center line as another indication of an out-of-control condition. Refer to Duncan (1974) for details concerning the “statistical” interpretation of the other (more complex) tests.

Zone A, B, C. Customarily, to define the runs tests, the area above and below the chart center line is divided into three “zones.”

By default, Zone A is defined as the area between 2 and 3 times sigma above and below the center line; Zone B is defined as the area between 1 and 2 times sigma, and Zone C is defined as the area between the center line and 1 times sigma.

9 points in Zone C or beyond (on one side of central line). If this test is positive (i.e., if this pattern is detected), then the process average has probably changed. Note that it is assumed that the distribution of the respective quality characteristic in the plot is symmetrical around the mean. This is, for example, not the case for R charts, S charts, or most attribute charts. However, this is still a useful test to alert the quality control engineer to potential shifts in the process. For example, successive samples with less-than-average variability may be worth investigating, since they may provide hints on how to decrease the variation in the process.

6 points in a row steadily increasing or decreasing. This test signals a drift in the process average. Often, such drift can be the result of tool wear, deteriorating maintenance, improvement in skill, etc. (Nelson, 1985).

14 points in a row alternating up and down. If this test is positive, it indicates that two systematically alternating causes are producing different results. For example, you may be using two alternating suppliers, or monitor the quality for two different (alternating) shifts.

2 out of 3 points in a row in Zone A or beyond. This test provides an “early warning” of a process shift. Note that the probability of a false-positive (test is positive but process is in control) for this test in X-bar charts is approximately 2%.

4 out of 5 points in a row in Zone B or beyond. Like the previous test, this test may be considered to be an “early warning indicator” of a potential process shift. The false- positive error rate for this test is also about 2%.

15 points in a row in Zone C (above and below the center line). This test indicates a smaller variability than is expected (based on the current control limits).

8 points in a row in Zone B, A, or beyond, on either side of the center line (without points in Zone C). This test indicates that different samples are affected by different factors, resulting in a bimodal distribution of means. This may happen, for example, if different samples in an X-bar chart where produced by one of two different machines, where one produces above average parts, and the other below average parts.

Operating Characteristic (OC) Curves

A common supplementary plot to standard quality control charts is the so-called operating characteristic or OC curve (see example below). One question that comes to mind when using standard variable or attribute charts is how sensitive is the current quality control procedure? Put in more specific terms, how likely is it that you will not find a sample (e.g., mean in an X-bar chart) outside the control limits (i.e., accept the production process as “in control”), when, in fact, it has shifted by a certain amount? This probability is usually referred to as the (beta) error probability, that is, the probability of erroneously accepting a process (mean, mean proportion, mean rate defectives, etc.) as being “in control.” Note that operating characteristic curves pertain to the false-acceptance probability using the sample-outside-of- control-limits criterion only, and not the runs tests described earlier.

Operating characteristic curves are extremely useful for exploring the power of our quality control procedure. The actual decision concerning sample sizes should depend not only on the cost of implementing the plan (e.g., cost per item sampled), but also on the costs resulting from not detecting quality problems. The OC curve allows the engineer to estimate the probabilities of not detecting shifts of certain sizes in the production quality.

Process Capability Indices

For variable control charts, it is often desired to include so-called process capability indices in the summary graph. In short, process capability indices express (as a ratio) the proportion of parts or items produced by the current process that fall within user-specified limits (e.g., engineering tolerances).

For example, the so-called Cp index is computed as:

Cp = (USL-LSL)/(6*sigma)

where sigma is the estimated process standard deviation, and USL and LSL are the upper and lower specification (engineering) limits, respectively. If the distribution of the respective quality characteristic or variable (e.g., size of piston rings) is normal, and the process is perfectly centered (i.e., the mean is equal to the design center), then this index can be interpreted as the proportion of the range of the standard normal curve (the process width) that falls within the engineering specification limits. If the process is not centered, an adjusted index Cpk is used instead. For a “capable” process, the Cp index should be greater than 1, that is, the specification limits would be larger than 6 times the sigma limits, so that over 99% of all items or parts produced could be expected to fall inside the acceptable engineering specifications. For a detailed discussion of this and other indices, refer to Process Analysis

Other Specialized Control Charts

The types of control charts mentioned so far are the “workhorses” of quality control, and they are probably the most widely used methods. However, with the advent of inexpensive desktop computing, procedures requiring more computational effort have become increasingly popular.

X-bar Charts For Non-Normal Data. The control limits for standard X-bar charts are constructed based on the assumption that the sample means are approximately normally distributed. Thus, the underlying individual observations do not have to be normally distributed, since, as the sample size increases, the distribution of the means will become approximately normal (i.e., see discussion of the central limit theorem in the Elementary Concepts; however, note that for R, S¸ and S**2 charts, it is assumed that the individual observations are normally distributed). Shewhart (1931) in his original work experimented with various non-normal distributions for individual observations, and evaluated the resulting distributions of means for samples of size four. He concluded that, indeed, the standard normal distribution-based control limits for the means are appropriate, as long as the underlying distribution of observations are approximately normal. (See also Hoyer and Ellis, 1996, for an introduction and discussion of the distributional assumptions for quality control charting.) .

However, as Ryan (1989) points out, when the distribution of observations is highly skewed and the sample sizes are small, then the resulting standard control limits may produce a large number of false alarms (increased alpha error rate), as well as a larger number of false negative (“process-is-in-control”) readings (increased beta-error rate). You can compute control limits (as well as process capability indices) for X-bar charts based on so-called Johnson curves(Johnson, 1949), which allow to approximate the skewness and kurtosis for a large range of non-normal distributions (see also Fitting Distributions by Moments, in Process Analysis). These non- normal X-bar charts are useful when the distribution of means across the samples is clearly skewed, or otherwise non-normal.

Hotelling T**2 Chart. When there are multiple related quality characteristics (recorded in several variables), we can produce a simultaneous plot (see example below) for all means based on Hotelling multivariate T**2 statistic (first proposed by Hotelling, 1947).

Cumulative Sum (CUSUM) Chart. The CUSUM chart was first introduced by Page (1954); the mathematical principles involved in its construction are discussed in Ewan (1963), Johnson (1961), and Johnson and Leone (1962).

If you plot the cumulative sum of deviations of successive sample means from a target specification, even minor, permanent shifts in the process mean will eventually lead to a sizable cumulative sum of deviations. Thus, this chart is particularly well-suited for detecting such small permanent shifts that may go undetected when using the X-bar chart. For example, if, due to machine wear, a process slowly “slides” out of control to produce results above target specifications, this plot would show a steadily increasing (or decreasing) cumulative sum of deviations from specification.

To establish control limits in such plots, Barnhard (1959) proposed the so-called V- mask, which is plotted after the last sample (on the right). The V-mask can be thought of as the upper and lower control limits for the cumulative sums. However, rather than being parallel to the center line; these lines converge at a particular angle to the right, producing the appearance of a V rotated on its side. If the line representing the cumulative sum crosses either one of the two lines, the process is out of control.

Moving Average (MA) Chart. To return to the piston ring example, suppose we are mostly interested in detecting small trends across successive sample means. For example, we may be particularly concerned about machine wear, leading to a slow but constant deterioration of quality (i.e., deviation from specification). The CUSUM chart described above is one way to monitor such trends, and to detect small permanent shifts in the process average. Another way is to use some weighting scheme that summarizes the means of several successive samples; moving such a weighted mean across the samples will produce a moving average chart (as shown in the following graph).

Exponentially-weighted Moving Average (EWMA) Chart. The idea of moving averages of successive (adjacent) samples can be generalized. In principle, in order to detect a trend we need to weight successive samples to form a moving average; however, instead of a simple arithmetic moving average, we could compute a geometric moving average (this chart (see graph below) is also called Geometric Moving Average chart, see Montgomery, 1985, 1991).

Specifically, we could compute each data point for the plot as:

zt = *x-bart + (1-)*zt-1

In this formula, each point zt is computed as (lambda) times the respective mean x-bart, plus one minus times the previous (computed) point in the plot. The parameter (lambda) here should assume values greater than 0 and less than 1. Without going into detail (see Montgomery, 1985, p. 239), this method of averaging specifies that the weight of historically “old” sample means decreases geometrically as you continue to draw samples. The interpretation of this chart is much like that of the moving average chart, and it allows us to detect small shifts in the means, and, therefore, in the quality of the production process.

Regression Control Charts. Sometimes we want to monitor the relationship between two aspects of our production process. For example, a post office may want to monitor the number of worker-hours that are spent to process a certain amount of mail. These two variables should roughly be linearly correlated with each other, and the relationship can probably be described in terms of the well-known Pearson product-moment correlation coefficient r. This statistic is also described in Basic Statistics. The regression control chart contains a regression line that summarizes the linear relationship between the two variables of interest. The individual data points are also shown in the same graph. Around the regression line we establish a confidence interval within which we would expect a certain proportion (e.g., 95%) of samples to fall. Outliers in this plot may indicate samples where, for some reason, the common relationship between the two variables of interest does not hold.

Applications. There are many useful applications for the regression control chart. For example, professional auditors may use this chart to identify retail outlets with a greater than expected number of cash transactions given the overall volume of sales, or grocery stores with a greater than expected number of coupons redeemed, given the total sales. In both instances, outliers in the regression control charts (e.g., too many cash transactions; too many coupons redeemed) may deserve closer scrutiny.

Pareto Chart Analysis. Quality problems are rarely spread evenly across the different aspects of the production process or different plants. Rather, a few “bad apples” often account for the majority of problems. This principle has come to be known as the Pareto principle, which basically states that quality losses are mal-distributed in such a way that a small percentage of possible causes are responsible for the majority of the quality problems. For example, a relatively small number of “dirty” cars are probably responsible for the majority of air pollution; the majority of losses in most companies result from the failure of only one or two products. To illustrate the “bad apples”, one plots the Pareto chart,

which simply amounts to a histogram showing the distribution of the quality loss (e.g., dollar loss) across some meaningful categories; usually, the categories are sorted into descending order of importance (frequency, dollar amounts, etc.). Very often, this chart provides useful guidance as to where to direct quality improvement efforts.

Process Analysis

Sampling plans are discussed in detail in Duncan (1974) and Montgomery (1985); most process capability procedures (and indices) were only recently introduced to the US from Japan (Kane, 1986), however, they are discussed in three excellent recent hands-on books by Bhote (1988), Hart and Hart (1989), and Pyzdek (1989); detailed discussions of these methods can also be found in Montgomery (1991).

Step-by-step instructions for the computation and interpretation of capability indices are also provided in the Fundamental Statistical Process Control Reference Manual published by the ASQC (American Society for Quality Control) and AIAG (Automotive Industry Action Group, 1991; referenced as ASQC/AIAG, 1991). Repeatability and reproducibility (R & R) methods are discussed in Grant and Leavenworth (1980), Pyzdek (1989) and Montgomery (1991); a more detailed discussion of the subject (of variance estimation) is also provided in Duncan (1974).

Step-by-step instructions on how to conduct and analyze R & R experiments are presented in the Measurement Systems Analysis Reference Manual published by ASQC/AIAG (1990). In the following topics, we will briefly introduce the purpose and logic of each of these procedures. For more information on analyzing designs with random effects and for estimating components of variance, see Variance Components.

Sampling Plans

General Purpose

A common question that quality control engineers face is to determine how many items from a batch (e.g., shipment from a supplier) to inspect in order to ensure that the items (products) in that batch are of acceptable quality. For example, suppose we have a supplier of piston rings for small automotive engines that our company produces, and our goal is to establish a sampling procedure (of piston rings from the delivered batches) that ensures a specified quality. In principle, this problem is similar to that of on-line quality control discussed in Quality Control. In fact, you may want to read that section at this point to familiarize yourself with the issues involved in industrial statistical quality control.

Acceptance sampling. The procedures described here are useful whenever we need to decide whether or not a batch or lot of items complies with specifications, without having to inspect 100% of the items in the batch. Because of the nature of the problem – whether to accept a batch – these methods are also sometimes discussed under the heading of acceptance sampling.

Advantages over 100% inspection. An obvious advantage of acceptance sampling over 100% inspection of the batch or lot is that reviewing only a sample requires less time, effort, and money. In some cases, inspection of an item is destructive (e.g., stress testing of steel), and testing 100% would destroy the entire batch. Finally, from a managerial standpoint, rejecting an entire batch or shipment (based on acceptance sampling) from a supplier, rather than just a certain percent of defective items (based on 100% inspection) often provides a stronger incentive to the supplier to adhere to quality standards.

Computational Approach

In principle, the computational approach to the question of how large a sample to take is straightforward. Elementary Concepts discusses the concept of the sampling distribution. Briefly, if we were to take repeated samples of a particular size from a population of, for example, piston rings and compute their average diameters, then the distribution of those averages (means) would approach the normal distribution with a particular mean and standard deviation (or standard error; in sampling distributions the term standard error is preferred, in order to distinguish the variability of the means from the variability of the items in the population). Fortunately, we do not need to take repeated samples from the population in order to estimate the location (mean) and variability (standard error) of the sampling distribution. If we have a good idea (estimate) of what the variability (standard deviation or sigma) is in the population, then we can infer the sampling distribution of the mean. In principle, this information is sufficient to estimate the sample size that is needed in order to detect a certain change in quality (from target specifications). Without going into the details about the computational procedures involved, let us next review the particular information that the engineer must supply in order to estimate required sample sizes.

Means for H0 and H1

To formalize the inspection process of, for example, a shipment of piston rings, we can formulate two alternative hypotheses: First, we may hypothesize that the average piston ring diameters comply with specifications. This hypothesis is called the null hypothesis (H0). The second and alternative hypothesis (H1) is that the diameters of the piston rings delivered to us deviate from specifications by more than a certain amount. Note that we may specify these types of hypotheses not just for measurable variables such as diameters of piston rings, but also for attributes. For example, we may hypothesize (H1) that the number of defective parts in the batch exceeds a certain percentage. Intuitively, it should be clear that the larger the difference between H0 and H1, the smaller the sample necessary to detect this difference (see Elementary Concepts).

Alpha and Beta Error Probabilities

To return to the piston rings example, there are two types of mistakes that we can make when inspecting a batch of piston rings that has just arrived at our plant. First, we may erroneously reject H0, that is, reject the batch because we erroneously conclude that the piston ring diameters deviate from target specifications. The probability of committing this mistake is usually called the alpha error probability. The second mistake that we can make is to erroneously not reject H0 (accept the shipment of piston rings), when, in fact, the mean piston ring diameter deviates from the target specification by a certain amount. The probability of committing this mistake is usually called the beta error probability. Intuitively, the more certain we want to be, that is, the lower we set the alpha and beta error probabilities, the larger the sample will have to be; in fact, in order to be 100% certain, we would have to measure every single piston ring delivered to our company.

Fixed Sampling Plans

To construct a simple sampling plan, we would first decide on a sample size, based on the means under H0/H1 and the particular alpha and beta error probabilities. Then, we would take a single sample of this fixed size and, based on the mean in this sample, decide whether to accept or reject the batch. This procedure is referred to as a fixed sampling plan.

Operating characteristic (OC) curve. The power of the fixed sampling plan can be summarized via the operating characteristic curve. In that plot, the probability of rejecting H0 (and accepting H1) is plotted on the Y axis, as a function of an actual shift from the target (nominal) specification to the respective values shown on the X axis of the plot (see example below). This probability is, of course, one minus the beta error probability of erroneously rejecting H1 and accepting H0; this value is referred to as the power of the fixed sampling plan to detect deviations. Also indicated in this plot are the power functions for smaller sample sizes.

Sequential Sampling Plans

As an alternative to the fixed sampling plan, we could randomly choose individual piston rings and record their deviations from specification. As we continue to measure each piston ring, we could keep a running total of the sum of deviations from specification. Intuitively, if H1 is true, that is, if the average piston ring diameter in the batch is not on target, then we would expect to observe a slowly increasing or decreasing cumulative sum of deviations, depending on whether the average diameter in the batch is larger or smaller than the specification, respectively. It turns out that this kind of sequential sampling of individual items from the batch is a more sensitive procedure than taking a fixed sample. In practice, we continue sampling until we either accept or reject the batch.

Using a sequential sampling plan. Typically, we would produce a graph in which the cumulative deviations from specification (plotted on the Y-axis) are shown for successively sampled items (e.g., piston rings, plotted on the X-axis). Then two sets of lines are drawn in this graph to denote the “corridor” along which we will continue to draw samples, that is, as long as the cumulative sum of deviations from specifications stays within this corridor, we continue sampling.

If the cumulative sum of deviations steps outside the corridor we stop sampling. If the cumulative sum moves above the upper line or below the lowest line, we reject the batch. If the cumulative sum steps out of the corridor to the inside, that is, if it moves closer to the center line, we accept the batch (since this indicates zero deviation from specification). Note that the inside area starts only at a certain sample number; this indicates the minimum number of samples necessary to accept the batch (with the current error probability).


To summarize, the idea of (acceptance) sampling is to use statistical “inference” to accept or reject an entire batch of items, based on the inspection of only relatively few items from that batch. The advantage of applying statistical reasoning to this decision is that we can be explicit about the probabilities of making a wrong decision.

Whenever possible, sequential sampling plans are preferable to fixed sampling plans because they are more powerful. In most cases, relative to the fixed sampling plan, using sequential plans requires fewer items to be inspected in order to arrive at a decision with the same degree of certainty.

Process (Machine) Capability Analysis

Introductory Overview

See also, Non-Normal Distributions.

Quality Control describes numerous methods for monitoring the quality of a production process. However, once a process is under control the question arises, “to what extent does the long-term performance of the process comply with engineering requirements or managerial goals?” For example, to return to our piston ring example, how many of the piston rings that we are using fall within the design specification limits? In more general terms, the question is, “how capable is our process (or supplier) in terms of producing items within the specification limits?” Most of the procedures and indices described here were only recently introduced to the US by Ford Motor Company (Kane, 1986). They allow us to summarize the process capability in terms of meaningful percentages and indices.

In this topic, the computation and interpretation of process capability indices will first be discussed for the normal distribution case. If the distribution of the quality characteristic of interest does not follow the normal distribution, modified capability indices can be computed based on the percentiles of a fitted non-normal distribution.

Order of business. Note that it makes little sense to examine the process capability if the process is not in control. If the means of successively taken samples fluctuate widely, or are clearly off the target specification, then those quality problems should be addressed first. Therefore, the first step towards a high-quality process is to bring the process under control, using the charting techniques available in Quality Control.

Computational Approach

Once a process is in control, we can ask the question concerning the process capability. Again, the approach to answering this question is based on “statistical” reasoning, and is actually quite similar to that presented earlier in the context of sampling plans. To return to the piston ring example, given a sample of a particular size, we can estimate the standard deviation of the process, that is, the resultant ring diameters. We can then draw a histogram of the distribution of the piston ring diameters. As we discussed earlier, if the distribution of the diameters is normal, then we can make inferences concerning the proportion of piston rings within specification limits.

(For non-normal distributions, see Percentile Method. Let us now review some of the major indices that are commonly used to describe process capability.

Capability Analysis – Process Capability Indices

Process range. First, it is customary to establish the ± 3 sigma limits around the nominal specifications. Actually, the sigma limits should be the same as the ones used to bring the process under control using Shewhart control charts (see Quality Control). These limits denote the range of the process (i.e., process range). If we use the ± 3 sigma limits then, based on the normal distribution, we can estimate that approximately 99% of all piston rings fall within these limits.

Specification limits LSL, USL. Usually, engineering requirements dictate a range of acceptable values. In our example, it may have been determined that acceptable values for the piston ring diameters would be 74.0 ± .02 millimeters. Thus, the lower specification limit (LSL) for our process is 74.0 – 0.02 = 73.98; the upper specification limit (USL) is 74.0 + 0.02 = 74.02. The difference between USL and LSL is called the specification range.

Potential capability (Cp). This is the simplest and most straightforward indicator of process capability. It is defined as the ratio of the specification range to the process range; using ± 3 sigma limits we can express this index as:

Cp = (USL-LSL)/(6*Sigma)

Put into words, this ratio expresses the proportion of the range of the normal curve that falls within the engineering specification limits (provided that the mean is on target, that is, that the process is centered, see below).

Bhote (1988) reports that prior to the widespread use of statistical quality control techniques (prior to 1980), the normal quality of US manufacturing processes was approximately Cp = .67. This means that the two 33/2 percent tail areas of the normal curve fall outside specification limits. As of 1988, only about 30% of US processes are at or below this level of quality (see Bhote, 1988, p. 51). Ideally, of course, we would like this index to be greater than 1, that is, we would like to achieve a process capability so that no (or almost no) items fall outside specification limits. Interestingly, in the early 1980’s the Japanese manufacturing industry adopted as their standard Cp = 1.33! The process capability required to manufacture high-tech products is usually even higher than this; Minolta has established a Cp index of 2.0 as their minimum standard (Bhote, 1988, p. 53), and as the standard for its suppliers. Note that high process capability usually implies lower, not higher costs, taking into account the costs due to poor quality. We will return to this point shortly.

Capability ratio (Cr). This index is equivalent to Cp; specifically, it is computed as 1/Cp (the inverse of Cp).

Lower/upper potential capability: Cpl, Cpu. A major shortcoming of the Cp (and Cr) index is that it may yield erroneous information if the process is not on target, that is, if it is not centered. We can express non-centering via the following quantities. First, upper and lower potential capability indices can be computed to reflect the deviation of the observed process mean from the LSL and USL.. Assuming ± 3 sigma limits as the process range, we compute:

Cpl = (Mean – LSL)/3*Sigma
Cpu = (USL – Mean)/3*Sigma

Obviously, if these values are not identical to each other, then the process is not centered.

Non-centering correction (K). We can correct Cp for the effects of non-centering. Specifically, we can compute:

K=abs(D – Mean)/(1/2*(USL – LSL))


D = (USL+LSL)/2.

This correction factor expresses the non-centering (target specification minus mean) relative to the specification range.

Demonstrated excellence (Cpk). Finally, we can adjust Cp for the effect of non-centering by computing:

Cpk = (1-k)*Cp

If the process is perfectly centered, then k is equal to zero, and Cpk is equal to Cp. However, as the process drifts from the target specification, k increases and Cpk becomes smaller than Cp.

Potential Capability II: Cpm. A recent modification (Chan, Cheng, & Spiring, 1988) to Cp is directed at adjusting the estimate of sigma for the effect of (random) non-centering. Specifically, we may compute the alternative sigma (Sigma2) as:

Sigma2 = { (xi – TS)2/(n-1)}½

Sigma2 is the alternative estimate of sigma
xi          is the value of the i‘th observation in the sample
TS        is the target or nominal specification
n           is the number of observations in the sample

We may then use this alternative estimate of sigma to compute Cp as before; however, we will refer to the resultant index as Cpm.

Process Performance vs. Process Capability

When monitoring a process via a quality control chart (e.g., the X-bar and R-chart; Quality Control) it is often useful to compute the capability indices for the process. Specifically, when the data set consists of multiple samples, such as data collected for the quality control chart, then one can compute two different indices of variability in the data. One is the regular standard deviation for all observations, ignoring the fact that the data consist of multiple samples; the other is to estimate the process’s inherent variation from the within-sample variability. For example, when plotting X-bar and R-charts one may use the common estimator R-bar/d2 for the process sigma (e.g., see Duncan, 1974; Montgomery, 1985, 1991). Note however, that this estimator is only valid if the process is statistically stable. For a detailed discussion of the difference between the total process variation and the inherent variation refer to ASQC/AIAG reference manual (ASQC/AIAG, 1991, page 80).

When the total process variability is used in the standard capability computations, the resulting indices are usually referred to as process performance indices (as they describe the actual performance of the process), while indices computed from the inherent variation (within-sample sigma) are referred to as capability indices (since they describe the inherent capability of the process).

Using Experiments to Improve Process Capability

As mentioned before, the higher the Cp index, the better the process – and there is virtually no upper limit to this relationship. The issue of quality costs, that is, the losses due to poor quality, is discussed in detail in the context of Taguchi robust design methods (see Experimental Design). In general, higher quality usually results in lower costs overall; even though the costs of production may increase, the losses due to poor quality, for example, due to customer complaints, loss of market share, etc. are usually much greater. In practice, two or three well-designed experiments carried out over a few weeks can often achieve a Cp of 5 or higher. If you are not familiar with the use of designed experiments, but are concerned with the quality of a process, we strongly recommend that you review the methods detailed in Experimental Design.

Testing the Normality Assumption

The indices we have just reviewed are only meaningful if, in fact, the quality characteristic that is being measured is normally distributed. A specific test of the normality assumption (Kolmogorov-Smirnov and Chi-square test of goodness-of-fit) is available; these tests are described in most statistics textbooks, and they are also discussed in greater detail in Nonparametrics and Distribution Fitting.

A visual check for normality is to examine the probability-probability and quantile-quantile plots for the normal distribution. For more information, see Process Analysis and Non-Normal Distributions.

Tolerance Limits

Before the introduction of process capability indices in the early 1980’s, the common method for estimating the characteristics of a production process was to estimate and examine the tolerance limits of the process (see, for example, Hald, 1952). The logic of this procedure is as follows. Let us assume that the respective quality characteristic is normally distributed in the population of items produced; we can then estimate the lower and upper interval limits that will ensure with a certain level of confidence (probability) that a certain percent of the population is included in those limits. Put another way, given:

  1. a specific sample size (n),
  2. the process mean,
  3. the process standard deviation (sigma),
  4. a confidence level, and
  5. the percent of the population that we want to be included in the interval,

we can compute the corresponding tolerance limits that will satisfy all these parameters. You can also compute parameter-free tolerance limits that are not based on the assumption of normality (Scheffe & Tukey, 1944, p. 217; Wilks, 1946, p. 93; see also Duncan, 1974, or Montgomery, 1985, 1991).

See also, Non-Normal Distributions.

Gage Repeatability and Reproducibility

Introductory Overview

Gage repeatability and reproducibility analysis addresses the issue of precision of measurement. The purpose of repeatability and reproducibility experiments is to determine the proportion of measurement variability that is due to (1) the items or parts being measured (part-to-part variation), (2) the operator or appraiser of the gages (reproducibility), and (3) errors (unreliabilities) in the measurements over several trials by the same operators of the same parts (repeatability). In the ideal case, all variability in measurements will be due to the part-to-part variation, and only a negligible proportion of the variability will be due to operator reproducibility and trial-to-trial repeatability.

To return to the piston ring example , if we require detection of deviations from target specifications of the magnitude of .01 millimeters, then we obviously need to use gages of sufficient precision. The procedures described here allow the engineer to evaluate the precision of gages and different operators (users) of those gages, relative to the variability of the items in the population.

You can compute the standard indices of repeatability, reproducibility, and part-to-part variation, based either on ranges (as is still common in these types of experiments) or from the analysis of variance (ANOVA) table (as, for example, recommended in ASQC/AIAG, 1990, page 65). The ANOVA table will also contain an F test (statistical significance test) for the operator-by-part interaction, and report the estimated variances, standard deviations, and confidence intervals for the components of the ANOVA model.

Finally, you can compute the respective percentages of total variation, and report so-called percent-of-tolerance statistics. These measures are briefly discussed in the following sections of this introduction. Additional information can be found in Duncan (1974), Montgomery (1991), or the DataMyte Handbook (1992); step-by-step instructions and examples are also presented in the ASQC/AIAG Measurement systems analysis reference manual (1990) and the ASQC/AIAG Fundamental statistical process control reference manual (1991).

Note that there are several other statistical procedures which may be used to analyze these types of designs; see the section on Methods for Analysis of Variance for details. In particular the methods discussed in the Variance Components and Mixed Model ANOVA/ANCOVA chapter are very efficient for analyzing very large nested designs (e.g., with more than 200 levels overall), or hierarchically nested designs (with or without random factors).

Computational Approach

One may think of each measurement as consisting of the following components:

  1. a component due to the characteristics of the part or item being measured,
  2. a component due to the reliability of the gage, and
  3. a component due to the characteristics of the operator (user) of the gage.

The method of measurement (measurement system) is reproducible if different users of the gage come up with identical or very similar measurements. A measurement method is repeatable if repeated measurements of the same part produces identical results. Both of these characteristics – repeatability and reproducibility – will affect the precision of the measurement system.

We can design an experiment to estimate the magnitudes of each component, that is, the repeatability, reproducibility, and the variability between parts, and thus assess the precision of the measurement system. In essence, this procedure amounts to an analysis of variance (ANOVA) on an experimental design which includes as factors different parts, operators, and repeated measurements (trials). We can then estimate the corresponding variance components (the term was first used by Daniels, 1939) to assess the repeatability (variance due to differences across trials), reproducibility (variance due to differences across operators), and variability between parts (variance due to differences across parts). If you are not familiar with the general idea of ANOVA, you may want to refer to ANOVA/MANOVA. In fact, the extensive features provided there can also be used to analyze repeatability and reproducibility studies.

Plots of Repeatability and Reproducibility

There are several ways to summarize via graphs the findings from a repeatability and reproducibility experiment. For example, suppose we are manufacturing small kilns that are used for drying materials for other industrial production processes. The kilns should operate at a target temperature of around 100 degrees Celsius. In this study, 5 different engineers (operators) measured the same sample of 8 kilns (parts), three times each (three trials). We can plot the mean ratings of the 8 parts by operator. If the measurement system is reproducible, then the pattern of means across parts should be quite consistent across the 5 engineers who participated in the study.

R and S charts. Quality Control discusses in detail the idea of R (range) and S (sigma) plots for controlling process variability. We can apply those ideas here and produce a plot of ranges (or sigmas) by operators or by parts; these plots will allow us to identify outliers among operators or parts. If one operator produced particularly wide ranges of measurements, we may want to find out why that particular person had problems producing reliable measurements (e.g., perhaps he or she failed to understand the instructions for using the measurement gage).

Analogously, producing an R chart by parts may allow us to identify parts that are particularly difficult to measure reliably; again, inspecting that particular part may give us some insights into the weaknesses in our measurement system.

Repeatability and reproducibility summary plot. The summary plot shows the individual measurements by each operator; specifically, the measurements are shown in terms of deviations from the respective average rating for the respective part. Each trial is represented by a point, and the different measurement trials for each operator for each part are connected by a vertical line. Boxes drawn around the measurements give us a general idea of a particular operator’s bias (see graph below).

Components of Variance

(see also Variance Components)

Percent of Process Variation and Tolerance. The Percent Tolerance allows you to evaluate the performance of the measurement system with regard to the overall process variation, and the respective tolerance range. You can specify the tolerance range (Total tolerance for parts) and the Number of sigma intervals. The latter value is used in the computations to define the range (spread) of the respective (repeatability, reproducibility, part-to-part, etc.) variability. Specifically, the default value (5.15) defines 5.15 times the respective sigma estimate as the respective range of values; if the data are normally distributed, then this range defines 99% of the space under the normal curve, that is, the range that will include 99% of all values (or reproducibility/repeatability errors) due to the respective source of variation.

Percent of process variation. This value reports the variability due to different sources relative to the total variability (range) in the measurements.

Analysis of Variance. Rather than computing variance components estimates based on ranges, an accurate method for computing these estimates is based on the ANOVA mean squares (see Duncan, 1974, ASQC/AIAG, 1990 ).

One may treat the three factors in the R & R experiment (Operator, Parts, Trials) as random factors in a three-way ANOVA model (see also General ANOVA/MANOVA). For details concerning the different models that are typically considered, refer to ASQC/AIAG (1990, pages 92-95), or to Duncan (1974, pages 716-734). Customarily, it is assumed that all interaction effects by the trial factor are non-significant. This assumption seems reasonable, since, for example, it is difficult to imagine how the measurement of some parts will be systematically different in successive trials, in particular when parts and trials are randomized.

However, the Operator by Parts interaction may be important. For example, it is conceivable that certain less experienced operators will be more prone to particular biases, and hence will arrive at systematically different measurements for particular parts. If so, then one would expect a significant two-way interaction (again, refer to General ANOVA/MANOVA if you are not familiar with ANOVA terminology).

In the case when the two-way interaction is statistically significant, then one can separately estimate the variance components due to operator variability, and due to the operator by parts variability

In the case of significant interactions, the combined repeatability and reproducibility variability is defined as the sum of three components: repeatability (gage error), operator variability, and the operator-by-part variability.

If the Operator by Part interaction is not statistically significant a simpler additive model can be used without interactions.


To summarize, the purpose of the repeatability and reproducibility procedures is to allow the quality control engineer to assess the precision of the measurement system (gages) used in the quality control process. Obviously, if the measurement system is not repeatable (large variability across trials) or reproducible (large variability across operators) relative to the variability between parts, then the measurement system is not sufficiently precise to be used in the quality control efforts. For example, it should not be used in charts produced via Quality Control, or product capability analyses and acceptance sampling procedures via Process Analysis.

Non-Normal Distributions

Introductory Overview

General Purpose. The concept of process capability is described in detail in the Process Capability Overview. To reiterate, when judging the quality of a (e.g., production) process it is useful to estimate the proportion of items produced that fall outside a predefined acceptable specification range. For example, the so-called Cp index is computed as:

Cp – (USL-LSL)/(6*sigma)

where sigma is the estimated process standard deviation, and USL and LSL are the upper and lower specification limits, respectively. If the distribution of the respective quality characteristic or variable (e.g., size of piston rings) is normal, and the process is perfectly centered (i.e., the mean is equal to the design center), then this index can be interpreted as the proportion of the range of the standard normal curve (the process width) that falls within the engineering specification limits. If the process is not centered, an adjusted index Cpk is used instead.

Non-Normal Distributions. You can fit non-normal distributions to the observed histogram, and compute capability indices based on the respective fitted non-normal distribution (via the percentile method). In addition, instead of computing capability indices by fitting specific distributions, you can compute capability indices based on two different general families of distributions: the Johnson distributions (Johnson, 1965; see also Hahn and Shapiro, 1967) and Pearson distributions (Johnson, Nixon, Amos, and Pearson, 1963; Gruska, Mirkhani, and Lamberson, 1989; Pearson and Hartley, 1972), which allow us to approximate a wide variety of continuous distributions. For all distributions, we can also compute the table of expected frequencies, the expected number of observations beyond specifications, and quantile-quantile and probability-probability plots. The specific method for computing process capability indices from these distributions is described in Clements (1989).

Quantile-quantile plots and probability-probability plots. There are various methods for assessing the quality of respective fit to the observed data. In addition to the table of observed and expected frequencies for different intervals, and the Kolmogorov-Smirnov and Chi-square goodness-of-fit tests, you can compute quantile and probability plots for all distributions. These scatterplots are constructed so that if the observed values follow the respective distribution, then the points will form a straight line in the plot. These plots are described further below.

Fitting Distributions by Moments

In addition to the specific continuous distributions described above, you can fit general “families” of distributions – the so-called Johnson and Pearson curves – with the goal to match the first four moments of the observed distribution.

General approach. The shapes of most continuous distributions can be sufficiently summarized in the first four moments. Put another way, if one fits to a histogram of observed data a distribution that has the same mean (first moment), variance (second moment), skewness (third moment) and kurtosis (fourth moment) as the observed data, then one can usually approximate the overall shape of the distribution very well. Once a distribution has been fitted, one can then calculate the expected percentile values under the (standardized) fitted curve, and estimate the proportion of items produced by the process that fall within the specification limits.

Johnson curves. Johnson (1949) described a system of frequency curves that represents transformations of the standard normal curve (see Hahn and Shapiro, 1967, for details). By applying these transformations to a standard normal variable, a wide variety of non-normal distributions can be approximated, including distributions which are bounded on either one or both sides (e.g., U-shaped distributions). The advantage of this approach is that once a particular Johnson curve has been fit, the normal integral can be used to compute the expected percentage points under the respective curve. Methods for fitting Johnson curves, so as to approximate the first four moments of an empirical distribution, are described in detail in Hahn and Shapiro, 1967, pages 199-220; and Hill, Hill, and Holder, 1976.

Pearson curves. Another system of distributions was proposed by Karl Pearson (e.g., see Hahn and Shapiro, 1967, pages 220-224). The system consists of seven solutions (of 12 originally enumerated by Pearson) to a differential equation which also approximate a wide range of distributions of different shapes. Gruska, Mirkhani, and Lamberson (1989) describe in detail how the different Pearson curves can be fit to an empirical distribution. A method for computing specific Pearson percentiles is also described in Davis and Stephens (1983).

Assessing the Fit: Quantile and Probability Plots

For each distribution, you can compute the table of expected and observed frequencies and the respective Chi-square goodness-of-fit test, as well as the Kolmogorov-Smirnov d test. However, the best way to assess the quality of the fit of a theoretical distribution to an observed distribution is to review the plot of the observed distribution against the theoretical fitted distribution. There are two standard types of plots used for this purpose: Quantile-quantile plots and probability-probability plots.

Quantile-quantile plots. In quantile-quantile plots (or Q-Q plots for short), the observed values of a variable are plotted against the theoretical quantiles. To produce a Q-Q plot, you first sort the n observed data points into ascending order, so that:

x1 x2 xn

These observed values are plotted against one axis of the graph; on the other axis the plot will show:

F-1 ((i-radj)/(n+nadj))

where i is the rank of the respective observation, radj and nadj are adjustment factors ( 0.5) and F-1 denotes the inverse of the probability integral for the respective standardized distribution. The resulting plot (see example below) is a scatterplot of the observed values against the (standardized) expected values, given the respective distribution. Note that, in addition to the inverse probability integral value, you can also show the respective cumulative probability values on the opposite axis, that is, the plot will show not only the standardized values for the theoretical distribution, but also the respective p-values.

A good fit of the theoretical distribution to the observed values would be indicated by this plot if the plotted values fall onto a straight line. Note that the adjustment factors radj and nadj ensure that the p-value for the inverse probability integral will fall between 0 and 1, but not including 0 and 1 (see Chambers, Cleveland, Kleiner, and Tukey, 1983).

Probability-probability plots. In probability-probability plots (or P-P plots for short) the observed cumulative distribution function is plotted against the theoretical cumulative distribution function. As in the Q-Q plot, the values of the respective variable are first sorted into ascending order. The i‘th observation is plotted against one axis as i/n (i.e., the observed cumulative distribution function), and against the other axis as F(x(i)), where F(x(i)) stands for the value of the theoretical cumulative distribution function for the respective observation x(i). If the theoretical cumulative distribution approximates the observed distribution well, then all points in this plot should fall onto the diagonal line (as in the graph below).

Non-Normal Process Capability Indices (Percentile Method)

As described earlier, process capability indices are generally computed to evaluate the quality of a process, that is, to estimate the relative range of the items manufactured by the process (process width) with regard to the engineering specifications. For the standard, normal-distribution-based, process capability indices, the process width is typically defined as 6 times sigma, that is, as plus/minus 3 times the estimated process standard deviation. For the standard normal curve, these limits (zl = -3 and zu = +3) translate into the 0.135 percentile and 99.865 percentile, respectively. In the non-normal case, the 3 times sigma limits as well as the mean (zM = 0.0) can be replaced by the corresponding standard values, given the same percentiles, under the non-normal curve. This procedure is described in detail by Clements (1989).

Process capability indices. Shown below are the formulas for the non-normal process capability indices:

Cp = (USL-LSL)/(Up-Lp)

CpL = (M-LSL)/(M-Lp)

CpU = (USL-M)/(Up-M)

Cpk = Min(CpU, CpL)

In these equations, M represents the 50’th percentile value for the respective fitted distribution, and Up and Lp are the 99.865 and .135 percentile values, respectively, if the computations are based on a process width of ±3 times sigma. Note that the values for Up and Lp may be different, if the process width is defined by different sigma limits (e.g., ±2 times sigma).


Weibull and Reliability/Failure Time Analysis

A key aspect of product quality is product reliability. A number of specialized techniques have been developed to quantify reliability and to estimate the “life expectancy” of a product. Standard references and textbooks describing these techniques include Lawless (1982), Nelson (1990), Lee (1980, 1992), and Dodson (1994); the relevant functions of the Weibull distribution (hazard, CDF, reliability) are also described in the Weibull CDF, reliability, and hazard functions section. Note that very similar statistical procedures are used in the analysis of survival data (see also the description of Survival Analysis), and, for example, the descriptions in Lee’s book (Lee, 1992) are primarily addressed to biomedical research applications. An excellent overview with many examples of engineering applications is provided by Dodson (1994).

General Purpose

The reliability of a product or component constitutes an important aspect of product quality. Of particular interest is the quantification of a product’s reliability, so that one can derive estimates of the product’s expected useful life. For example, suppose you are flying a small single engine aircraft. It would be very useful (in fact vital) information to know what the probability of engine failure is at different stages of the engine’s “life” (e.g., after 500 hours of operation, 1000 hours of operation, etc.). Given a good estimate of the engine’s reliability, and the confidence limits of this estimate, one can then make a rational decision about when to swap or overhaul the engine.

The Weibull Distribution

A useful general distribution for describing failure time data is the Weibull distribution (see also Weibull CDF, reliability, and hazard functions). The distribution is named after the Swedish professor Waloddi Weibull, who demonstrated the appropriateness of this distribution for modeling a wide variety of different data sets (see also Hahn and Shapiro, 1967; for example, the Weibull distribution has been used to model the life times of electronic components, relays, ball bearings, or even some businesses).

Hazard function and the bathtub curve. It is often meaningful to consider the function that describes the probability of failure during a very small time increment (assuming that no failures have occurred prior to that time). This function is called the hazard function (or, sometimes, also the conditional failure, intensity, or force of mortality function), and is generally defined as:

h(t) = f(t)/(1-F(t))

where h(t) stands for the hazard function (of time t), and f(t) and F(t) are the probability density and cumulative distribution functions, respectively. The hazard (conditional failure) function for most machines (components, devices) can best be described in terms of the “bathtub” curve: Very early during the life of a machine, the rate of failure is relatively high (so-called Infant Mortality Failures); after all components settle, and the electronic parts are burned in, the failure rate is relatively constant and low. Then, after some time of operation, the failure rate again begins to increase (so-called Wear-out Failures), until all components or devices will have failed.

For example, new automobiles often suffer several small failures right after they were purchased. Once these have been “ironed out,” a (hopefully) long relatively trouble-free period of operation will follow. Then, as the car reaches a particular age, it becomes more prone to breakdowns, until finally, after 20 years and 250000 miles, practically all cars will have failed. A typical bathtub hazard function is shown below.

The Weibull distribution is flexible enough for modeling the key stages of this typical bathtub-shaped hazard function. Shown below are the hazard functions for shape parameters c=.5, c=1, c=2, and c=5.

Clearly, the early (“infant mortality”) “phase” of the bathtub can be approximated by a Weibull hazard function with shape parameter c<1; the constant hazard phase of the bathtub can be modeled with a shape parameter c=1, and the final (“wear-out”) stage of the bathtub with c>1.

Cumulative distribution and reliability functions. Once a Weibull distribution (with a particular set of parameters) has been fit to the data, a number of additional important indices and measures can be estimated. For example, you can compute the cumulative distribution function (commonly denoted as F(t)) for the fitted distribution, along with the standard errors for this function. Thus, you can determine the percentiles of the cumulative survival (and failure) distribution, and, for example, predict the time at which a predetermined percentage of components can be expected to have failed.

The reliability function (commonly denoted as R(t)) is the complement to the cumulative distribution function (i.e., R(t)=1-F(t)); the reliability function is also sometimes referred to as the survivorship or survival function (since it describes the probability of not failing or of surviving until a certain time t; e.g., see Lee, 1992). Shown below is the reliability function for the Weibull distribution, for different shape parameters.

For shape parameters less than 1, the reliability decreases sharply very early in the respective product’s life, and then slowly thereafter. For shape parameters greater than 1, the initial drop in reliability is small, and then the reliability drops relatively sharply at some point later in time. The point where all curves intersect is called the characteristic life: regardless of the shape parameter, 63.2 percent of the population will have failed at or before this point (i.e., R(t) = 1-0.632 = .368). This point in time is also equal to the respective scale parameter b of the two-parameter Weibull distribution (with = 0; otherwise it is equal to b+).

The formulas for the Weibull cumulative distribution, reliability, and hazard functions are shown in the Weibull CDF, reliability, and hazard functions section.

Censored Observations

In most studies of product reliability, not all items in the study will fail. In other words, by the end of the study the researcher only knows that a certain number of items have not failed for a particular amount of time, but has no knowledge of the exact failure times (i.e., “when the items would have failed”). Those types of data are called censored observations. The issue of censoring, and several methods for analyzing censored data sets, are also described in great detail in the context of Survival Analysis. Censoring can occur in many different ways.

Type I and II censoring. So-called Type I censoring describes the situation when a test is terminated at a particular point in time, so that the remaining items are only known not to have failed up to that time (e.g., we start with 100 light bulbs, and terminate the experiment after a certain amount of time). In this case, the censoring time is often fixed, and the number of items failing is a random variable. In Type II censoring the experiment would be continued until a fixed proportion of items have failed (e.g., we stop the experiment after exactly 50 light bulbs have failed). In this case, the number of items failing is fixed, and time is the random variable.

Left and right censoring. An additional distinction can be made to reflect the “side” of the time dimension at which censoring occurs. In the examples described above, the censoring always occurred on the right side (right censoring), because the researcher knows when exactly the experiment started, and the censoring always occurs on the right side of the time continuum. Alternatively, it is conceivable that the censoring occurs on the left side (left censoring). For example, in biomedical research one may know that a patient entered the hospital at a particular date, and that s/he survived for a certain amount of time thereafter; however, the researcher does not know when exactly the symptoms of the disease first occurred or were diagnosed.

Single and multiple censoring. Finally, there are situations in which censoring can occur at different times (multiple censoring), or only at a particular point in time (single censoring). To return to the light bulb example, if the experiment is terminated at a particular point in time, then a single point of censoring exists, and the data set is said to be single-censored. However, in biomedical research multiple censoring often exists, for example, when patients are discharged from a hospital after different amounts (times) of treatment, and the researcher knows that the patient survived up to those (differential) points of censoring.

The methods described in this section are applicable primarily to right censoring, and single- as well as multiple-censored data.

Two- and Three-Parameter Weibull Distribution

The Weibull distribution is bounded on the left side. If you look at the probability density function, you can see that that the term x- must be greater than 0. In most cases, the location parameter (theta) is known (usually 0): it identifies the smallest possible failure time. However, sometimes the probability of failure of an item is 0 (zero) for some time after a study begins, and in that case it may be necessary to estimate a location parameter that is greater than 0. There are several methods for estimating the location parameter of the three-parameter Weibull distribution. To identify situations when the location parameter is greater than 0, Dodson (1994) recommends to look for downward of upward sloping tails on a probability plot (see below), as well as large (>6) values for the shape parameter after fitting the two-parameter Weibull distribution, which may indicate a non-zero location parameter.

Parameter Estimation

Maximum likelihood estimation. Standard iterative function minimization methods can be used to compute maximum likelihood parameter estimates for the two- and three-parameter Weibull distribution. The specific methods for estimating the parameters are described in Dodson (1994); a detailed description of a Newton-Raphson iterative method for estimating the maximum likelihood parameters for the two-parameter distribution is provided in Keats and Lawrence (1997).

The estimation of the location parameter for the three-parameter Weibull distribution poses a number of special problems, which are detailed in Lawless (1982). Specifically, when the shape parameter is less than 1, then a maximum likelihood solution does not exist for the parameters. In other instances, the likelihood function may contain more than one maximum (i.e., multiple local maxima). In the latter case, Lawless basically recommends using the smallest failure time (or a value that is a little bit less) as the estimate of the location parameter.

Nonparametric (rank-based) probability plots. One can derive a descriptive estimate of the cumulative distribution function (regardless of distribution) by first rank-ordering the observations, and then computing any of the following expressions:

Median rank:

F(t) = (j-0.3)/(n+0.4)

Mean rank:

F(t) = j/(n+1)

White’s plotting position:

F(t) = (j-3/8)/(n+1/4)

where j denotes the failure order (rank; for multiple-censored data a weighted average ordered failure is computed; see Dodson, p. 21), and n is the total number of observations. One can then construct the following plot.

Note that the horizontal Time axis is scaled logarithmically; on the vertical axis the quantity log(log(100/(100-F(t))) is plotted (a probability scale is shown on the left-y axis). From this plot the parameters of the two-parameter Weibull distribution can be estimated; specifically, the shape parameter is equal to the slope of the linear fit-line, and the scale parameter can be estimated as exp(-intercept/slope).

Estimating the location parameter from probability plots. It is apparent in the plot shown above that the regression line provides a good fit to the data. When the location parameter is misspecified (e.g., not equal to zero), then the linear fit is worse as compared to the case when it is appropriately specified. Therefore, one can compute the probability plot for several values of the location parameter, and observe the quality of the fit. These computations are summarized in the following plot.

Here the common R-square measure (correlation squared) is used to express the quality of the linear fit in the probability plot, for different values of the location parameter shown on the horizontal x axis (this plot is based on the example data set in Dodson, 1994, Table 2.9). This plot is often very useful when the maximum likelihood estimation procedure for the three-parameter Weibull distribution fails, because it shows whether or not a unique (single) optimum value for the location parameter exists (as in the plot shown above).

Hazard plotting. Another method for estimating the parameters for the two-parameter Weibull distribution is via hazard plotting (as discussed earlier, the hazard function describes the probability of failure during a very small time increment, assuming that no failures have occurred prior to that time). This method is very similar to the probability plotting method. First plot the cumulative hazard function against the logarithm of the survival times; then fit a linear regression line and compute the slope and intercept of that line. As in probability plotting, the shape parameter can then be estimated as the slope of the regression line, and the scale parameter as exp(-intercept/slope). See Dodson (1994) for additional details; see also Weibull CDF, reliability, and hazard functions.

Method of moments. This method – to approximate the moments of the observed distribution by choosing the appropriate parameters for the Weibull distribution – is also widely described in the literature. In fact, this general method is used for fitting the Johnson curves general non-normal distribution to the data, to compute non-normal process capability indices (see Fitting Distributions by Moments). However, the method is not suited for censored data sets, and is therefore not very useful for the analysis of failure time data.

Comparing the estimation methods. Dodson (1994) reports the result of a Monte Carlo simulation study, comparing the different methods of estimation. In general, the maximum likelihood estimates proved to be best for large sample sizes (e.g., n>15), while probability plotting and hazard plotting appeared to produce better (more accurate) estimates for smaller samples.

A note of caution regarding maximum likelihood based confidence limits. Many software programs will compute confidence intervals for maximum likelihood estimates, and for the reliability function based on the standard errors of the maximum likelihood estimates. Dodson (1994) cautions against the interpretation of confidence limits computed from maximum likelihood estimates, or more precisely, estimates that involve the information matrix for the estimated parameters. When the shape parameter is less than 2, the variance estimates computed for maximum likelihood estimates lack accuracy, and it is advisable to compute the various results graphs based on nonparametric confidence limits as well.

Goodness of Fit Indices

A number of different tests have been proposed for evaluating the quality of the fit of the Weibull distribution to the observed data. These tests are discussed and compared in detail in Lawless (1982).

Hollander-Proschan. This test compares the theoretical reliability function to the Kaplan-Meier estimate. The actual computations for this test are somewhat complex, and you may refer to Dodson (1994, Chapter 4) for a detailed description of the computational formulas. The Hollander-Proschan test is applicable to complete, single-censored, and multiple-censored data sets; however, Dodson (1994) cautions that the test may sometimes indicate a poor fit when the data are heavily single-censored. The Hollander-Proschan C statistic can be tested against the normal distribution (z).

Mann-Scheuer-Fertig. This test, proposed by Mann, Scheuer, and Fertig (1973), is described in detail in, for example, Dodson (1994) or Lawless (1982). The null hypothesis for this test is that the population follows the Weibull distribution with the estimated parameters. Nelson (1982) reports this test to have reasonably good power, and this test can be applied to Type II censored data. For computational details refer to Dodson (1994) or Lawless (1982); the critical values for the test statistic have been computed based on Monte Carlo studies, and have been tabulated for n (sample sizes) between 3 and 25.

Anderson-Darling. The Anderson-Darling procedure is a general test to compare the fit of an observed cumulative distribution function to an expected cumulative distribution function. However, this test is only applicable to complete data sets (without censored observations). The critical values for the Anderson-Darling statistic have been tabulated (see, for example, Dodson, 1994, Table 4.4) for sample sizes between 10 and 40; this test is not computed for n less than 10 and greater than 40.

Interpreting Results

Once a satisfactory fit of the Weibull distribution to the observed failure time data has been obtained, there are a number of different plots and tables that are of interest to understand the reliability of the item under investigation. If a good fit for the Weibull cannot be established, distribution-free reliability estimates (and graphs) should be reviewed to determine the shape of the reliability function.

Reliability plots. This plot will show the estimated reliability function along with the confidence limits.

Note that nonparametric (distribution-free) estimates and their standard errors can also be computed and plotted.

Hazard plots. As mentioned earlier, the hazard function describes the probability of failure during a very small time increment (assuming that no failures have occurred prior to that time). The plot of hazard as a function of time gives valuable information about the conditional failure probability.

Percentiles of the reliability function. Based on the fitted Weibull distribution, one can compute the percentiles of the reliability (survival) function, along with the confidence limits for these estimates (for maximum likelihood parameter estimates). These estimates are particularly valuable for determining the percentages of items that can be expected to have failed at particular points in time.

Grouped Data

In some cases, failure time data are presented in grouped form. Specifically, instead of having available the precise failure time for each observation, only aggregate information is available about the number of items that failed or were censored in a particular time interval. Such life-table data input is also described in the context of the Survival Analysis chapter. There are two general approaches for fitting the Weibull distribution to grouped data.

First, one can treat the tabulated data as if they were continuous. In other words, one can “expand” the tabulated values into continuous data by assuming (1) that each observation in a given time interval failed exactly at the interval mid-point (interpolating out “half a step” for the last interval), and (2) that censoring occurred after the failures in each interval (in other words, censored observations are sorted after the observed failures). Lawless (1982) advises that this method is usually satisfactory if the class intervals are relatively narrow.

Alternatively, you may treat the data explicitly as a tabulated life table, and use a weighted least squares methods algorithm (based on Gehan and Siddiqui, 1973; see also Lee, 1992) to fit the Weibull distribution (Lawless, 1982, also describes methods for computing maximum likelihood parameter estimates from grouped data).

Modified Failure Order for Multiple-Censored Data

For multiple-censored data a weighted average ordered failure is calculated for each failure after the first censored data point. These failure orders are then used to compute the median rank, to estimate the cumulative distribution function.

The modified failure order j is computed as (see Dodson 1994):

Ij = ((n+1)-Op)/(1+c)


Ij      is the increment for the j’th failure
n      is the total number of data points
Op   is the failure order of the previous observation (and Oj = Op + Ij)
c      is the number of data points remaining in the data set, including the current data point

The median rank is then computed as:

F(t) = (Ij -0.3)/(n+0.4)

where Ij denotes the modified failure order, and n is the total number of observations.

Weibull CDF, Reliability, and Hazard

Density function. The Weibull distribution (Weibull, 1939, 1951; see also Lieblein, 1955) has density function (for positive parameters b, c, and ):

f(x) = c/b*[(x-)/b]c-1 * e^{-[(x-)/b]c}
< x,  b > 0,  c > 0

b     is the scale parameter of the distribution
c     is the shape parameter of the distribution
   is the location parameter of the distribution
e     is the base of the natural logarithm, sometimes called Euler’s e (2.71…)

Cumulative distribution function (CDF). The Weibull distribution has the cumulative distribution function (for positive parameters b, c, and ):

F(x) = 1 – exp{-[(x-)/b]c}

using the same notation and symbols as described above for the density function.

Reliability function. The Weibull reliability function is the complement of the cumulative distribution function:

R(x) = 1 – F(x)

Hazard function. The hazard function describes the probability of failure during a very small time increment, assuming that no failures have occurred prior to that time. The Weibull distribution has the hazard function (for positive parameters b, c, and ):

h(t) = f(t)/R(t) = [c*(x-)(c-1)] / bc

using the same notation and symbols as described above for the density and reliability functions.

Cumulative hazard function. The Weibull distribution has the cumulative hazard function (for positive parameters b, c, and ):

H(t) = (x-) / bc

using the same notation and symbols as described above for the density and reliability functions.



Data Mining & StatSoft Power Solutions

Analytics supported

by third party report

The non-profit Electric Power Research Institute (EPRI) recently conducted a study of the StatSoft technology to determine its suitability for optimizing the performance (heat-rate, emissions, LOI) in an older coal-fired power plant. EPRI ordered from StatSoft an optimization project to be conducted under scrutiny of their inspectors.

Using nine months worth of detailed 6-minute interval data describing more than 140 parameters of the process, EPRI found that process data analysis using STATISTICA is a cost-effective solution for optimizing the use of current process hardware to save cost and reduce emissions.

Overview of the Approach

StatSoft Power Solutions offer solution packages designed for utility companies, for optimizing power plant performance, increasing the efficiency, and reducing emissions. Based on over 20 years of experience in applying advanced data-driven, data mining optimization technologies for process optimization in various industries, these solutions will allow power plants to get the most out of their equipment and control systems, by leveraging all data collected at your site to identify opportunities for improvement, even for older designs such as coal-fired Cyclone furnaces (as well as wall-fired or T-fired designs).

Opportunities for Data Driven Strategies to Improve Powerplant Performance

Many (most) power generation facilities are collecting “lots of data” into dedicated historical process data bases. (such as OSI PI) However, in most cases, only simple charts and “after-the-fact” ad-hoc analyses are performed on a small subset of those data; most information is simply not used.

For example, for coal fired power plants, our solutions can help you identify optimum settings for stoichiometric ratio, primary/tertiary air flows, secondary air biases, distribution of OFA (overfired air), burner tilts and yaw positions, and other controllable parameters to reduce NOx, CO, and LOI, without requiring any re-engineering of existing hardware.

What is Data Mining? Why Data Mining?

Data Mining is the term used to describe the application of various machine learning and/or pattern recognition algorithms and techniques, to identify complex relationships among observed variables. These techniques can reveal invaluable insights when the data contain meaningful information which is “hidden” deep inside your data set, and cannot be identified with simple methods. Advanced data mining can reveal those insights by processing many variables and complex interrelations between them, all at the same time.

Unlike CFD, data mining allows you to model the “real world” from “real data,” describing your specific plant. Using this approach, you can:

  • Identify from among hundreds or even thousands of input parameters those that are critical for low-emissions efficient operations
  • Determine the ranges for those parameters, and combinations of parameter ranges that will result in robust and stable low-emissions operations, without costly excursions (high-emissions events, unscheduled maintenance and expensive generation roll-backs).

These results can be implemented using your existing closed-or-open loop control system to achieve sustained improvements in power plant performance, or you can use StatSoft MultiStream to create a state-of-the-art advanced process monitoring system to achieve permanent improvements.

How is this Different from “Neural Nets” for Closed Loop Control?

One frequently asked question is: How do these solutions differ from neural networks based computer programs that can control critical power plant operations in a closed loop system (an approach used at some plants, often with less than expected success)?

The answer is that, because those systems are based on relatively simple, traditional neural networks technology which typically can only simultaneously process relatively few parameters, they are not capable of identifying the important parameters from among hundreds of possible candidates, and they will not identify specific combinations of parameter ranges (“sweet spots”) that make overall power plant operations more robust.

The cutting-edge technologies developed by StatSoft Power Solutions will not simply implement a cookie-cutter approach to use a few parameters common to all power plants to achieve some (usually only very modest) overall process performance improvements. Instead, our approach allows you to take a fresh look at all your data and operations, to optimize them for best performance. This will allow you to focus your process monitoring efforts, operator training, or automation initiatives only on those parameters that actually drive boiler efficiency, emissions, and so on at your plant and for your equipment.

What we are offering is not simply another neural net for closed loop control; instead, it provides flexible tools based on cutting-edge data processing technologies to optimize all systems, and also provides smart monitoring and advisory options capable of predicting problems, such as emissions related to combustion optimization or maintenance issues.

Contact StatSoft Southern Africa for more information about our services, software solutions, and recent success stories.

Designing an Experiment, Power Analysis

The techniques of statistical power analysis, sample size estimation, and advanced techniques for confidence interval estimation are discussed here. The main goal of first the two techniques is to allow you to decide, while in the process of designing an experiment, (a) how large a sample is needed to enable statistical judgments that are accurate and reliable and (b) how likely your statistical test will be to detect effects of a given size in a particular situation. The third technique is useful in implementing objectives a and b and in evaluating the size of experimental effects in practice.

Performing power analysis and sample size estimation is an important aspect of experimental design, because without these calculations, sample size may be too high or too low. If sample size is too low, the experiment will lack the precision to provide reliable answers to the questions it is investigating. If sample size is too large, time and resources will be wasted, often for minimal gain.

In some power analysis software programs, a number of graphical and analytical tools are available to enable precise evaluation of the factors affecting power and sample size in many of the most commonly encountered statistical analyses. This information can be crucial to the design of a study that is cost-effective and scientifically useful.

Noncentrality interval estimationprocedures and other sophisticated confidence interval procedures provide some sophisticated confidence interval methods for analyzing the importance of an observed experimental result. An increasing number of influential statisticians are suggesting that confidence interval estimation should augment or replace traditional hypothesis testing approaches in the analysis of experimental data.


Power Analysis and Sample Size Calculation in Experimental Design

There is a growing recognition of the importance of power analysis and sample size calculation in the proper design of experiments. Click on the links below for a discussion of the fundamental ideas behind these methods.

Sampling Theory

In most situations in statistical analysis, we do not have access to an entire statistical population of interest, either because the population is too large, is not willing to be measured, or the measurement process is too expensive or time-consuming to allow more than a small segment of the population to be observed. As a result, we often make important decisions about a statistical population on the basis of a relatively small amount of sample data.

Typically, we take a sample and compute a quantity called a statistic in order to estimate some characteristic of a population called a parameter.

For example, suppose a politician is interested in the proportion of people who currently favor her position on a particular issue. Her constituency is a large city with a population of about 1,500,000 potential voters. In this case, the parameter of interest, which we might call , is the proportion of people in the entire population who favor the politician’s position. The politician is going to commission an opinion poll, in which a (hopefully) random sample of people will be asked whether or not they favor her position. The number (call it N) of people to be polled will be quite small, relative to the size of the population. Once these people have been polled, the proportion of them favoring the politician’s position will be computed. This proportion, which is a statistic, can be called p.

One thing is virtually certain before the study is ever performed: p will not be equal to ! Because p involves “the luck of the draw,” it will deviate from . The amount by which p is wrong, i.e., the amount by which it deviates from , is called sampling error.

In any one sample, it is virtually certain there will be some sampling error (except in some highly unusual circumstances), and that we will never be certain exactly how large this error is. If we knew the amount of the sampling error, this would imply that we also knew the exact value of the parameter, in which case we would not need to be doing the opinion poll in the first place.

In general, the larger the sample size N, the smaller sampling error tends to be. (You can never be sure what will happen in a particular experiment, of course.) If we are to make accurate decisions about a parameter like , we need to have an N large enough so that sampling error will tend to be “reasonably small.” If N is too small, there is not much point in gathering the data, because the results will tend to be too imprecise to be of much use.

On the other hand, there is also a point of diminishing returns beyond which increasing N provides little benefit. Once N is “large enough” to produce a reasonable level of accuracy, making it larger simply wastes time and money.

So some key decisions in planning any experiment are, “How precise will my parameter estimates tend to be if I select a particular sample size?” and “How big a sample do I need to attain a desirable level of precision?”

The purpose of Power Analysis and Sample Size Estimation is to provide you with the statistical methods to answer these questions quickly, easily, and accurately. A good statistical software program will provide simple dialogs for performing power calculations and sample size estimation for many of the classic statistical procedures as well as special noncentral distribution routines to allow the advanced user to perform a variety of additional calculations.

Hypothesis Testing Logic

Suppose that the politician was interested in showing that more than the majority of people supported her position. Her question, in statistical terms: “Is > .50?” Being an optimist, she believes that it is.

In statistics, the following strategy is quite common. State as a “statistical null hypothesis” something that is the logical opposite of what you believe. Call this hypothesis H0. Gather data. Then, using statistical theory, show from the data that it is likely H0 is false, and should be rejected.

By rejecting H0, you support what you actually believe. This kind of situation, which is typical in many fields of research, for example, is called “Reject-Support testing,” (RS testing) because rejecting the null hypothesis supports the experimenter’s theory.

The null hypothesis is either true or false, and the statistical decision process is set up so that there are no “ties.” The null hypothesis is either rejected or not rejected. Consequently, before undertaking the experiment, we can be certain that only 4 possible things can happen. These are summarized in the table below


State of the World
Decision H0 Correct
Type II Error
H1 Type I Error

Note that there are two kinds of errors represented in the table. Many statistics textbooks present a point of view that is common in the social sciences, i.e., that , the Type I error rate, must be kept at or below .05, and that, if at all possible, , the Type II error rate, must be kept low as well. “Statistical power,” which is equal to 1 – , must be kept correspondingly high. Ideally, power should be at least .80 to detect a reasonable departure from the null hypothesis.

The conventions are, of course, much more rigid with respect to than with respect to . For example, in the social sciences seldom, if ever, is allowed to stray above the magical .05 mark.

Significance Testing (RS/AS). In the context of significance testing, we can define two basic kinds of situations, reject-support (RS) (discussed above) and accept-support (AS). In RS testing, the null hypothesis is the opposite of what the researcher actually believes, and rejecting it supports the researcher’s theory. In a two group RS experiment involving comparison of the means of an experimental and control group, the experimenter believes the treatment has an effect, and seeks to confirm it through a significance test that rejects the null hypothesis.

In the RS situation, a Type I error represents, in a sense, a “false positive” for the researcher’s theory. From society’s standpoint, such false positives are particularly undesirable. They result in much wasted effort, especially when the false positive is interesting from a theoretical or political standpoint (or both), and as a result stimulates a substantial amount of research. Such follow-up research will usually not replicate the (incorrect) original work, and much confusion and frustration will result.

In RS testing, a Type II error is a tragedy from the researcher’s standpoint, because a theory that is true is, by mistake, not confirmed. So, for example, if a drug designed to improve a medical condition is found (incorrectly) not to produce an improvement relative to a control group, a worthwhile therapy will be lost, at least temporarily, and an experimenter’s worthwhile idea will be discounted.

As a consequence, in RS testing, society, in the person of journal editors and reviewers, insists on keeping low. The statistically well-informed researcher makes it a top priority to keep low as well. Ultimately, of course, everyone benefits if both error probabilities are kept low, but unfortunately there is often, in practice, a trade-off between the two types of error.

The RS situation is by far the more common one, and the conventions relevant to it have come to dominate popular views on statistical testing. As a result, the prevailing views on error rates are that relaxing beyond a certain level is unthinkable, and that it is up to the researcher to make sure statistical power is adequate. You might argue how appropriate these views are in the context of RS testing, but they are not altogether unreasonable.

In AS testing, the common view on error rates we described above is clearly inappropriate. In AS testing, H0 is what the researcher actually believes, so accepting it supports the researcher’s theory. In this case, a Type I error is a false negative for the researcher’s theory, and a Type II error constitutes a false positive. Consequently, acting in a way that might be construed as highly virtuous in the RS situation, for example, maintaining a very low Type I error rate like .001, is actually “stacking the deck” in favor of the researcher’s theory in AS testing.

In both AS and RS situations, it is easy to find examples where significance testing seems strained and unrealistic. Consider first the RS situation. In some such situations, it is simply not possible to have very large samples. An example that comes to mind is social or clinical psychological field research. Researchers in these fields sometimes spend several days interviewing a single subject. A year’s research may only yield valid data from 50 subjects. Correlational tests, in particular, have very low power when samples are that small. In such a case, it probably makes sense to relax beyond .05, if it means that reasonable power can be achieved.

On the other hand, it is possible, in an important sense, to have power that is too high. For example, you might be testing the hypothesis that two population means are equal (i.e., Mu1 = Mu2) with sample sizes of a million in each group. In this case, even with trivial differences between groups, the null hypothesis would virtually always be rejected.

The situation becomes even more unnatural in AS testing. Here, if N is too high, the researcher almost inevitably decides against the theory, even when it turns out, in an important sense, to be an excellent approximation to the data. It seems paradoxical indeed that in this context experimental precision seems to work against the researcher.

To summarize:

In Reject-Support research:

  1. The researcher wants to reject H0.
  2. Society wants to control Type I error.
  3. The researcher must be very concerned about Type II error.
  4. High sample size works for the researcher.
  5. If there is “too much power,” trivial effects become “highly significant.”

In Accept-Support research:

  1. The researcher wants to accept H0.
  2. “Society” should be worrying about controlling Type II error, although it sometimes gets confused and retains the conventions applicable to RS testing.
  3. The researcher must be very careful to control Type I error.
  4. High sample size works against the researcher.
  5. If there is “too much power,” the researcher’s theory can be “rejected” by a significance test even though it fits the data almost perfectly.

Calculating Power

Properly designed experiments must ensure that power will be reasonably high to detect reasonable departures from the null hypothesis. Otherwise, an experiment is hardly worth doing. Elementary textbooks contain detailed discussions of the factors influencing power in a statistical test. These include

  1. What kind of statistical test is being performed. Some statistical tests are inherently more powerful than others.
  2. Sample size. In general, the larger the sample size, the larger the power. However, generally increasing sample size involves tangible costs, both in time, money, and effort. Consequently, it is important to make sample size “large enough,” but not wastefully large.
  3. The size of experimental effects. If the null hypothesis is wrong by a substantial amount, power will be higher than if it is wrong by a small amount.
  4. The level of error in experimental measurements. Measurement error acts like “noise” that can bury the “signal” of real experimental effects. Consequently, anything that enhances the accuracy and consistency of measurement can increase statistical power.

Calculating Required Sample Size

To ensure a statistical test will have adequate power, you usually must perform special analyses prior to running the experiment, to calculate how large an N is required.

Let’s briefly examine the kind of statistical theory that lies at the foundation of the calculations used to estimate power and sample size. Return to the original example of the politician, contemplating how large an opinion poll should be taken to suit her purposes.

Statistical theory, of course, cannot tell us what will happen with any particular opinion poll. However, through the concept of a sampling distribution, it can tell us what will tend to happen in the long run, over many opinion polls of a particular size.

A sampling distribution is the distribution of a statistic over repeated samples. Consider the sample proportion p resulting from an opinion poll of size N, in the situation where the population proportion is exactly .50. Sampling distribution theory tells us that p will have a distribution that can be calculated from the binomial theorem. This distribution, for reasonably large N, and for values of p not too close to 0 or 1, looks very much like a normal distribution with a mean of and a standard deviation (called the “standard error of the proportion“) of

sp = (p(1-p)/N)**1/2

Suppose, for example, the politician takes an opinion poll based on an N of 100. Then the distribution of p, over repeated samples, will look like this if = .5.

The values are centered around .5, but a small percentage of values are greater than .6 or less than .4. This distribution of values reflects the fact that an opinion poll based on a sample of 100 is an imperfect indicator of the population proportion .

If p were a “perfect” estimate of , the standard error of the proportion would be zero, and the sampling distribution would be a spike located at 0.5. The spread of the sampling distribution indicates how much “noise” is mixed in with the “signal” generated by the parameter.

Notice from the equation for the standard error of the proportion that, as N increases, the standard error of the proportion gets smaller. If N becomes large enough, we can be very certain that our estimate p will be a very accurate one.

Suppose the politician uses a decision criterion as follows. If the observed value of p is greater than .58, she will decide that the null hypothesis that is less than or equal to .50 is false. This rejection rule is diagrammed below.

You may, by adding up all the probabilities (computable from the binomial distribution), determine that the probability of rejecting the null hypothesis when p = .50 is .044. Hence, this decision rule controls the Type I Error rate, , at or below .044. It turns out, this is the lowest decision criterion that maintains at or below .05.

However, the politician is also concerned about power in this situation, because it is by rejecting the null hypothesis that she is able to support the notion that she has public opinion on her side.

Suppose that 55% of the people support the politician, that is, that = .55 and the null hypothesis is actually false. In this case, the correct decision is to reject the null hypothesis. What is the probability that she will obtain a sample proportion greater than the “cut-off” value of .58 required to reject the null hypothesis?

In the figure below, we have superimposed the sampling distribution for p when = .55. Clearly, only a small percentage of the time will the politician reach the correct decision that she has majority support. The probability of obtaining a p greater than .58 is only .241.

Needless to say, there is no point in conducting an experiment in which, if your position is correct, it will only be verified 24.1% of the time! In this case a statistician would say that the significance test has “inadequate power to detect a departure of 5 percentage points from the null hypothesized value.”

The crux of the problem lies in the width of the two distributions in the preceding figure. If the sample size were larger, the standard error of the proportion would be smaller, and there would be little overlap between the distributions. Then it would be possible to find a decision criterion that provides a low and high power.

The question is, “How large an N is necessary to produce a power that is reasonably high” in this situation, while maintaining at a reasonably low value.

You could, of course, go through laborious, repetitive calculations in order to arrive at such a sample size. However, a good software program will perform them automatically, with just a few clicks of the mouse. Moreover, for each analytic situation that it handles, it will provide extensive capabilities for analyzing and graphing the theoretical relationships between power, sample size, and the variables that affect them. Assuming that the user will be employing the well known chi-square test, rather than the exact binomial test, suppose that the politician decides that she requires a power of .80 to detect a p of .80. It turns out, a sample size of 607 will yield a power of exactly .8009. (The actual alpha of this test, which has a nominal level of .05, is .0522 in this situation.)


Graphical Approaches to Power Analysis

In the preceding discussion, we arrived at a necessary sample size of 607 under the assumption that p is precisely .80. In practice, of course, we would be foolish to perform only one power calculation, based on one hypothetical value. For example, suppose the function relating required sample size to p is particularly steep in this case. It might then be that the sample size required for a p of .70 is much different than that required to reliably detect a p of .80.

Intelligent analysis of power and sample size requires the construction, and careful evaluation, of graphs relating power, sample size, the amount by which the null hypothesis is wrong (i.e., the experimental effect), and other factors such as Type I error rate.

In the example discussed in the preceding section, the goal, from the standpoint of the politician, is to plan a study that can decide, with a low probability of error, whether the support level is greater than .50. Graphical analysis can shed a considerable amount of light on the capabilities of a statistical test to provide the desired information under such circumstances.

For example, the researcher could plot power against sample size, under the assumption that the true level is .55, i.e., 55%. The user might start with a graph that covers a very wide range of sample sizes, to get a general idea of how the statistical test behaves. The following graph shows power as a function of sample sizes ranging from 20 to 2000, using a “normal approximation” to the exact binomial distribution.

The previous graph demonstrates that power reaches an acceptable level (often considered to be between .80 and .90) at a sample size of approximately 600.

Remember, however, that this calculation is based on the supposition that the true value of p is .55. It may be that the shape of the curve relating power and sample size is very sensitive to this value. The question immediately arises, “how sensitive is the slope of this graph to changes in the actual value of p?

There are a number of ways to address this question. You can plot power vs. sample size for other values of p, for example. Below is a graph of power vs. sample size for p = .6.

You can see immediately in the preceding graph that the improvement in power for increases in N occurs much more rapidly for p = .6 than for p = .55. The difference is striking if you merge the two graphs into one, as shown below.

In planning a study, particularly when a grant proposal must be submitted with a proposed sample size, you must estimate what constitutes a reasonable minimum effect that you wish to detect, a minimum power to detect that effect, and the sample size that will achieve that desired level of power. This sample size can be obtained by analyzing the above graphs (additionally, some software packages can calculate it directly). For example, if the user requests the minimum sample size required to achieve a power of .90 when p = .55, some programs can calculate this directly. The result is reported in a spreadsheet, as below,


One Proportion, Z
(or Chi-Square) Test
H0: Pi < = Pi0
Null Hypothesized Proportion (Pi0) .5000
Population Proportion (Pi) .5500
Alpha (Nominal) .0500
Required Power .9000
Required Sample Size (N) 853.0000
Actual Alpha (Exact) .0501
Power (Normal Approximation) .9001
Power (Exact) .9002

For a given level of power, a graph of sample size vs. p can show how sensitive the required sample size is to the actual value of p. This can be important in gauging how sensitive the estimate of a required sample size is. For example, the following graph shows values of N needed to achieve a power of .90 for various values of p, when the null hypothesis is that p = .50

The preceding graph demonstrates how the required N drops off rapidly as p varies from .55 to .60. To be able to reliably detect a difference of .05 (from the null hypothesized value of .50) requires an N greater than 800, but reliable detection of a difference of .10 requires an N of only around 200. Obviously, then, required sample size is somewhat difficult to pinpoint in this situation. It is much better to be aware of the overall performance of the statistical test against a range of possibilities before beginning an experiment, than to be informed of an unpleasant reality after the fact. For example, imagine that the experimenter had estimated the required sample size on the basis of reliably (with power of .90) detecting a p of .6. The experimenter budgets for a sample size of, say, 220, and imagines that minor departures of p from .6 will not require substantial differences in N. Only later does the experimenter realize that a small change in requires a huge increase in N , and that the planning for the experiment was optimistic. In some such situations, a “window of opportunity” may close before the sample size can be adjusted upward.

Across a wide variety of analytic situations, Power analysis and sample size estimation involve steps that are fundamentally the same.

  1. The type of analysis and null hypothesis are specified
  2. Power and required sample size for a reasonable range of effects is investigated.
  3. The sample size required to detect a reasonable experimental effect (i.e., departure from the null hypothesis), with a reasonable level of power, is calculated, while allowing for a reasonable margin of error.

Noncentrality Interval Estimation and the Evaluation of Statistical Models

Power Analysis and Interval Estimation includes a number of confidence intervals that are not widely available in general purpose statistics packages. Several of these are discussed within a common theoretical framework, called “noncentrality interval estimation,” by Steiger and Fouladi (1997). In this section, we briefly review some of the basic rationale behind the emerging popularity of confidence intervals.

Inadequacies of the Hypothesis Testing Approach

Strictly speaking, the outcome of a significance test is the dichotomous decision whether or not to reject the null hypothesis. This dichotomy is inherently dissatisfying to many scientists who use the null hypothesis as a statement of no effect, and are more interested in knowing how big an effect is than whether it is (precisely) zero. This has led to behavior like putting one, two, or three asterisks next to results in tables, or listing p-values next to results, when, in fact, such numbers, across (or sometimes even within!) studies need not be monotonically related to the best estimates of strength of experimental effects, and hence can be extremely misleading. Some writers (e.g., Guttman, 1977) view asterisk-placing behavior as inconsistent with the foundations of significance testing logic.

Probability levels can deceive about the “strength” of a result, especially when presented without supporting information. For example, if, in an ANOVA table, one effect had a p-value of .019, and the other a p-value of .048, it might be an error to conclude that the statistical evidence supported the view that the first effect was stronger than the second. A meaningful interpretation would require additional information. To see why, suppose someone reports a p-value of .001. This could be representative of a trivial population effect combined with a huge sample size, or a powerful population effect combined with a moderate sample size, or a huge population effect with a small sample. Similarly a p-value of .075 could represent a powerful effect operating with a small sample, or a tiny effect with a huge sample. Clearly then, we need to be careful when comparing p-values.

In Accept-Support testing, which occurs frequently in the context of model fitting in factor analysis or “causal modeling,” significance testing logic is basically inappropriate. Rejection of an “almost true” null hypothesis in such situations frequently has been followed by vague statements that the rejection shouldn’t be taken too seriously. Failure to reject a null hypothesis usually results in a demand by a vigilant journal editor for cumbersome power calculations. Such problems can be avoided to some extent by using confidence intervals.


Advantages of Interval Estimation

Much research is exploratory. The fundamental questions in exploratory research are “What is our best guess for the size of the population effect?” and “How precisely have we determined the population effect size from our sample data?” Significance testing fails to answer these questions directly. Many a researcher, faced with an “overwhelming rejection” of a null hypothesis, cannot resist the temptation to report that it was “significant well beyond the .001 level.” Yet it is widely agreed that a p-value following a significance test can be a poor vehicle for conveying what we have learned about the strength of population effects.

Confidence interval estimation provides a convenient alternative to significance testing in most situations. Consider the 2-tailed hypothesis of no difference between means. Recall first that the significance test rejects at the significance level if and only if the 1 – confidence interval for the mean difference excludes the value zero. Thus the significance test can be performed with the confidence interval. Most undergraduate texts in behavioral statistics show how to compute such a confidence interval. The interval is exact under the assumptions of the standard t test. However, the confidence interval contains information about experimental precision that is not available from the result of a significance test. Assuming we are reasonably confident about the metric of the data, it is much more informative to state a confidence interval on Mu1 – Mu2 than it is to give the p-value for the t test of the hypothesis that Mu1 – Mu2 = 0 In summary, we might say that, in general, a confidence interval conveys more information, in a more naturally usable form, than a significance test.

This is seen most clearly when confidence intervals from several studies are graphed alongside one another, as in the figure below

The figure shows confidence intervals for the difference between means for 3 experiments, all performed in the same domain, using measures with approximately the same variability. Experiments 1 and 3 yield a confidence interval that fails to include zero. For these experiments, the null hypothesis was rejected. The second experiment yields a confidence interval that includes zero, so the null hypothesis of no difference is not rejected. A significance testing approach would yield the impression that the second experiment did not agree with the first and the third.

The confidence intervals suggest a different interpretation, however. The first experiment had a very large sample size, and very high precision of measurement, reflected in a very narrow confidence interval. In this experiment, a small effect was found, and determined with such high precision that the null hypothesis of no difference could be rejected at a stringent significance level.

The second experiment clearly lacked precision, and this is reflected in the very wide confidence interval. Evidently, the sample size was too small. It may well be that the actual effect in conditions assessed in the second experiment was larger than that in the first experiment, but the experimental precision was simply inadequate to detect it.

The third experiment found an effect that was statistically significant, and perhaps substantially higher than the first experiment, although this is partly masked by the lower level of precision, reflected in a confidence interval that, though narrower than Experiment 2, is substantially wider than Experiment 1.

Suppose the 3 experiments involved testing groups for differences in IQ. In the final analysis, we may have had too much power in Experiment 1, as we are declaring “highly significant” a rather miniscule effect substantially less than a single IQ point. We had far too little power in Experiment 2. Experiment 3 seems about right.

Many of the arguments we have made on behalf of confidence intervals have been made by others as cogently as we have made them here. Yet, confidence intervals are seldom reported in the literature. Most important, as we demonstrate in the succeeding sections, there are several extremely useful confidence intervals that virtually never are reported. In what follows, we discuss why the intervals are seldom reported.


Why Interval Estimates are Seldom Reported

In spite of the obvious advantages of interval estimates, they are seldom employed in published articles in many areas of science. On those infrequent occasions when interval estimates are reported, they are often not the optimal ones. There are several reasons for this status quo:

Tradition.Traditional approaches to statistics emphasize significance testing much more than interval estimation.

Pragmatism. In RS situations, interval estimates are sometimes embarrassing. When they are narrow but close to zero, they suggest that a “highly significant” result may be statistically significant but trivial. When they are wide, they betray a lack of experimental precision.

Ignorance.Many people are simply unaware of some of the very valuable interval estimation procedures that are available. For example, many textbooks on multivariate analysis never mention that it is possible to compute a confidence interval on the squared multiple correlation coefficient.

Lack of availability. Some of the most desirable interval estimation procedures are computer intensive, and have not been implemented in major statistical packages. This has made it less likely that anyone will try the procedure.


Replacing Traditional Hypothesis Tests with Interval Estimates

There are a number of confidence interval procedures that can replace and/or augment the traditional hypothesis tests used in classical testing situations. For a review of these techniques, see Steiger & Fouladi (1997).

Analysis of Variance. One area where confidence intervals have seldom been employed is in assessing strength of effects in the Analysis of Variance (ANOVA).

For example, suppose you are reading a paper, which reports that, in a 1-Way ANOVA, with 4 groups, and N = 60per group, an F statistic was found that is significant at the .05 level (“F = 2.70, p =.0464″). This result is statistically significant, but how meaningful is it in a practical sense? What have we learned about the size of the experimental effects?

Fleischman (1980) discusses a technique for setting a confidence interval on the overall effect size in the Analysis of Variance. This technique allows you to set a confidence interval on the RMSSE, the root-mean-square standardized effect (RMSSE). Standardized effects are reported in standard deviation units, and are hence remain constant when the unit of measurement changes. So, for example, an experimental effect reported in pounds would be different from the same effect reported in kilograms, whereas the standardized effect would be the same in each case. In the case of the data mentioned above, the F statistic that is significant at the .05 level yields a 90% confidence interval for the RMSSE that ranges from .0190 to .3139. The lower limit of this interval stands for a truly mediocre effect, less than 1/50th of a standard deviation. The upper limit of the interval represents effects on the order of 1/3 of a standard deviation, moderate but not overwhelming. It seems, then, that the results from this study need not imply really strong experimental effects, even though the effects are statistically “significant.”

Multiple Regression. The squared multiple correlation is reported frequently as an index of the overall strength of a prediction equation. After fitting a regression equation, the most natural questions to ask are, (a) “How effective is the regression equation at predicting the criterion?” and (b) “How precisely has this effectiveness been determined?”

Hence, one very common statistical application that practically cries out for a confidence interval is multiple regression analysis. Publishing an observed squared multiple R together with the result of a hypothesis test that the population squared multiple correlation is zero, conveys little of the available statistical information. A confidence interval on the populations squared multiple correlation is much more informative.

One software package computes exact confidence intervals for the population squared multiple correlation, following the approach of Steiger and Fouladi (1992). As an example, suppose a criterion is predicted from 45 independent observations on 5 variables and the observed squared multiple correlation is .40. In this case a 95% confidence interval for the population squared multiple correlation ranges from .095 to .562! A 95% lower confidence limit is at .129. On the other hand the sample multiple correlation value is significant “beyond the .001 level,” because the p-value is .0009, and the shrunken estimator is .327. Clearly, it is far more impressive to state that “the squared multiple R value is significant at the .001 level” than it is to state that “we are 95% confident that the population squared multiple correlation is between .095 and .562.” But we believe the latter statement conveys the quality and meaning of the statistical result more accurately than the former.

Some writers, like Lee (1972), prefer a lower confidence limit, or “statistical lower bound” on the squared multiple correlation to a confidence interval. The rationale, apparently, is that we are primarily interested in assuring that the percentage of variance “accounted for” in the regression equation exceeds some value. Although we understand the motivation behind this view, we hesitate to accept it. The confidence interval, in fact, contains a lower bound, but also includes an upper bound, and, in the interval width, a measure of precision of estimation. It seems to us that adoption of a lower confidence limit can lead to a false sense of security, and reduces that amount of information available in the model assessment process.

Partial Least Squares (PLS)

This topic describes the use of partial least squares regression analysis. If you are unfamiliar with the basic methods of regression in linear models, it may be useful to first review this information in Elementary Concepts. The different designs discussed in this topic are also described in General Linear Models, Generalized Linear Models, and General Stepwise Regression.

Basic Ideas

Partial least squares regressionis an extension of the multiple linear regression model (see, e.g., Multiple Regression or General Stepwise Regression). In its simplest form, a linear model specifies the (linear) relationship between a dependent (response) variable Y, and a set of predictor variables, the X‘s, so that

Y = b0 + b1X1 + b2X2 + … + bpXp

In this equation b0 is the regression coefficient for the intercept and the bi values are the regression coefficients (for variables 1 through p) computed from the data.

So for example, you could estimate (i.e., predict) a person’s weight as a function of the person’s height and gender. You could use linear regression to estimate the respective regression coefficients from a sample of data, measuring height, weight, and observing the subjects’ gender. For many data analysis problems, estimates of the linear relationships between variables are adequate to describe the observed data, and to make reasonable predictions for new observations (see Multiple Regression or General Stepwise Regression for additional details).

The multiple linear regression model has been extended in a number of ways to address more sophisticated data analysis problems. The multiple linear regression model serves as the basis for a number of multivariate methods such as discriminant analysis (i.e., the prediction of group membership from the levels of continuous predictor variables), principal components regression (i.e., the prediction of responses on the dependent variables from factors underlying the levels of the predictor variables), and canonical correlation (i.e., the prediction of factors underlying responses on the dependent variables from factors underlying the levels of the predictor variables). These multivariate methods all have two important properties in common. These methods impose restrictions such that (1) factors underlying the Y and X variables are extracted from the Y’Y and X’X matrices, respectively, and never from cross-product matrices involving both the Y and X variables, and (2) the number of prediction functions can never exceed the minimum of the number of Y variables and X variables.

Partial least squares regression extends multiple linear regression without imposing the restrictions employed by discriminant analysis, principal components regression,and canonical correlation. In partial least squaresregression, prediction functions are represented by factors extracted from the Y’XX’Y matrix. The number of such prediction functions that can be extracted typically will exceed the maximum of the number of Y and X variables.

In short, partial least squares regression is probably the least restrictive of the various multivariate extensions of the multiple linear regression model. This flexibility allows it to be used in situations where the use of traditional multivariate methods is severely limited, such as when there are fewer observations than predictor variables. Furthermore, partial least squares regression can be used as an exploratory analysis tool to select suitable predictor variables and to identify outliers before classical linear regression.

Partial least squares regression has been used in various disciplines such as chemistry, economics, medicine, psychology, and pharmaceutical science where predictive linear modeling, especially with a large number of predictors, is necessary. Especially in chemometrics, partial least squares regression has become a standard tool for modeling linear relations between multivariate measurements (de Jong, 1993).


Computational Approach

Basic Model

As in multiple linear regression, the main purpose of partial least squares regression is to build a linear model, Y=XB+E, where Y is an n cases by m variables response matrix, X is an n cases by p variables predictor (design) matrix, B is a p by m regression coefficient matrix, and E is a noise term for the model which has the same dimensions as Y. Usually, the variables in X and Y are centered by subtracting their means and scaled by dividing by their standard deviations. For more information about centering and scaling in partial least squares regression, you can refer to Geladi and Kowalski(1986).

Both principal components regression and partial least squares regression produce factor scores as linear combinations of the original predictor variables, so that there is no correlation between the factor score variables used in the predictive regression model. For example, suppose we have a data set with response variables Y (in matrix form) and a large number of predictor variables X (in matrix form), some of which are highly correlated. A regression using factor extraction for this type of data computes the factor score matrix T=XW for an appropriate weight matrix W, and then considers the linear regression model Y=TQ+E, where Q is a matrix of regression coefficients (loadings) for T, and E is an error (noise) term. Once the loadings Q are computed, the above regression model is equivalent to Y=XB+E, where B=WQ, which can be used as a predictive regression model.

Principal components regression and partial least squares regression differ in the methods used in extracting factor scores. In short, principal components regression produces the weight matrix W reflecting the covariance structure between the predictor variables, while partial least squares regression produces the weight matrix W reflecting the covariance structure between the predictor and response variables.

For establishing the model, partial least squares regression produces a p by c weight matrix Wfor X such that T=XW, i.e., the columns of W are weight vectors for the X columns producing the corresponding n by c factor score matrix T. These weights are computed so that each of them maximizes the covariance between responses and the corresponding factor scores. Ordinary least squares procedures for the regression of Y on T are then performed to produce Q, the loadings for Y (or weights for Y) such that Y=TQ+E. Once Q is computed, we have Y=XB+E, where B=WQ, and the prediction model is complete.

One additional matrix necessary for a complete description of partial least squares regression procedures is the p by c factor loading matrix P which gives a factor model X=TP+F, where F is the unexplained part of the X scores. We now can describe the algorithms for computing partial least squares regression.

NIPALS Algorithm

The standard algorithm for computing partial least squares regression components (i.e., factors) is nonlinear iterative partial least squares (NIPALS). There are many variants of the NIPALS algorithm which normalize or do not normalize certain vectors. The following algorithm, which assumes that the Xand Y variables have been transformed to have means of zero, is considered to be one of most efficient NIPALS algorithms.

For each h=1,…,c, where A0=X’Y, M0=X’X, C0=I, and c given,

  1. compute qh, the dominant eigenvector of Ah‘Ah
  2. wh=ChAhqh, wh=wh/||wh||, and store wh into W as a column
  3. ph=Mhwh, ch=wh‘Mhwh, ph=ph/ch, and store ph into P as a column
  4. qh=Ah‘wh/ch, and store qh into Q as a column
  5. Ah+1=Ah – chphqh and Mh+1=Mh – chphph
  6. Ch+1=Ch – whph

The factor scores matrix T is then computed as T=XW and the partial least squares regression coefficients B of Y on X are computed as B=WQ.

SIMPLS Algorithm

An alternative estimation method for partial least squares regression components is the SIMPLS algorithm (de Jong, 1993), which can be described as follows.

For each h=1,…,c, where A0=X’Y, M0=X’X, C0=I, and c given,

  1. compute qh, the dominant eigenvector of Ah‘Ah
  2. wh=Ahqh, ch=wh‘Mhwh, wh=wh/sqrt(ch), and store wh into W as a column
  3. ph=Mhwh, and store ph into P as a column
  4. qh=Ah‘wh, and store qh into Q as a column
  5. vh=Chph, and vh=vh/||vh||
  6. Ch+1=Ch – vhvh and Mh+1=Mh – phph
  7. Ah+1=ChAh

Similarly to NIPALS, the T of SIMPLS is computed as T=XW and B for the regression of Y on X is computed as B=WQ’.


Training (Analysis) and Verification (Cross-Validation) Samples

A very important step when fitting models to be used for prediction of future observation is to verify (cross-validate) the results, i.e., to apply the current results to a new set of observations that was not used to compute those results (estimate the parameters). Some software programs offer very flexible methods for computing detailed predicted value and residual statistics for observations (1) that were not used in the computations for fitting the current model and have observed values for the dependent variables (the so-called cross-validation sample), and (2) that were not used in the computations for fitting the current model, and have missing data for the dependent variables (prediction sample).


Types of Analyses

The design for an analysis can include effects for continuous as well as categorical predictor variables. Designs may include polynomials for continuous predictors (e.g., squared or cubic terms) as well as interaction effects (i.e., product terms) for continuous predictors. For categorical predictor, you can fit ANOVA-like designs, including full factorial, nested, and fractional factorial designs, etc. Designs can be incomplete (i.e., involve missing cells), and effects for categorical predictor variables can be represented using either the sigma-restricted parameterization or the overparameterized (i.e., indicator variable) representation of effects.

The topics below give complete descriptions of the types of designs that can be analyzed using partial least squares regression, as well as types of designs that can be analyzed using the general linear model.

Between-Subject Designs

Overview. The levels or values of the predictor variables in an analysis describe the differences between the n subjects or the n valid cases that are analyzed. Thus, when we speak of the between subject design (or simply the between design) for an analysis, we are referring to the nature, number, and arrangement of the predictor variables.

Concerning the nature or type of predictor variables, between designs which contain only categorical predictor variables can be called ANOVA (analysis of variance) designs, between designs which contain only continuous predictor variables can be called regression designs, and between designs which contain both categorical and continuous predictor variables can be called ANCOVA (analysis of covariance) designs. Further, continuous predictors are always considered to have fixed values, but the levels of categorical predictors can be considered to be fixed or to vary randomly. Designs which contain random categorical factors are called mixed-model designs (see Variance Components and Mixed Model ANOVA/ANCOVA).

Between designs may involve only a single predictor variable and therefore be described as simple (e.g., simple regression) or may employ numerous predictor variables (e.g., multiple regression).

Concerning the arrangement of predictor variables, some between designs employ only “main effect” or first-order terms for predictors, that is, the values for different predictor variables are independent and raised only to the first power. Other between designs may employ higher-order terms for predictors by raising the values for the original predictor variables to a power greater than 1 (e.g., in polynomial regression designs), or by forming products of different predictor variables (i.e., interaction terms). A common arrangement for ANOVA designs is the full-factorial design, in which every combination of levels for each of the categorical predictor variables is represented in the design. Designs with some but not all combinations of levels for each of the categorical predictor variables are aptly called fractional factorial designs. Designs with a hierarchy of combinations of levels for the different categorical predictor variables are called nested designs.

These basic distinctions about the nature, number, and arrangement of predictor variables can be used in describing a variety of different types of between designs. Some of the more common between designs can now be described.

One-Way ANOVA. A design with a single categorical predictor variable is called a one-way ANOVA design. For example, a study of 4 different fertilizers used on different individual plants could be analyzed via one-way ANOVA, with four levels for the factor Fertilizer.

In general, consider a single categorical predictor variable A with 1 case in each of its 3 categories. Using the sigma-restricted coding of A into 2 quantitative contrast variables, the matrix X defining the between design is

That is, cases in groups A1, A2, and A3 are all assigned values of 1 on X0 (the intercept), the case in group A1 is assigned a value of 1 on X1 and a value 0 on X2, the case in group A2 is assigned a value of 0 on X1 and a value 1 on X2, and the case in group A3 is assigned a value of -1 on X1 and a value -1 on X2. Of course, any additional cases in any of the 3 groups would be coded similarly. If there were 1 case in group A1, 2 cases in group A2, and 1 case in group A3, the X matrix would be

where the first subscript for A gives the replicate number for the cases in each group. For brevity, replicates usually are not shown when describing ANOVA design matrices.

Note that in one-way designs with an equal number of cases in each group, sigma-restricted coding yields X1 … Xk variables all of which have means of 0.

Using the overparameterized model to represent A, the X matrix defining the between design is simply

These simple examples show that the X matrix actually serves two purposes. It specifies (1) the coding for the levels of the original predictor variables on the X variables used in the analysis as well as (2) the nature, number, and arrangement of the X variables, that is, the between design.

Main Effect ANOVA. Main effect ANOVA designs contain separate one-way ANOVA designs for 2 or more categorical predictors. A good example of main effect ANOVA would be the typical analysis performed on screening designs as described in Experimental Design.

Consider 2 categorical predictor variables A and B each with 2 categories. Using the sigma-restricted coding, the X matrix defining the between design is

Note that if there are equal numbers of cases in each group, the sum of the cross-products of values for the X1 and X2 columns is 0, for example, with 1 case in each group (1*1)+(1*-1)+(-1*1)+(-1*-1)=0. Using the overparameterized model, the matrix X defining the between design is

Comparing the two types of coding, it can be seen that the overparameterized coding takes almost twice as many values as the sigma-restricted coding to convey the same information.

Factorial ANOVA. Factorial ANOVA designs contain X variables representing combinations of the levels of 2 or more categorical predictors (e.g., a study of boys and girls in four age groups, resulting in a 2 (Gender) x 4 (Age Group) design). In particular, full-factorial designs represent all possible combinations of the levels of the categorical predictors. A full-factorial design with 2 categorical predictor variables A and B each with 2 levels each would be called a 2 x 2 full-factorial design. Using the sigma-restricted coding, the X matrix for this design would be

Several features of this X matrix deserve comment. Note that the X1 and X2 columns represent main effect contrasts for one variable, (i.e., A and B, respectively) collapsing across the levels of the other variable. The X3 column instead represents a contrast between different combinations of the levels of A and B. Note also that the values for X3 are products of the corresponding values for X1 and X2. Product variables such as X3 represent the multiplicative or interaction effects of their factors, so X3 would be said to represent the 2-way interaction of A and B. The relationship of such product variables to the dependent variables indicate the interactive influences of the factors on responses above and beyond their independent (i.e., main effect) influences on responses. Thus, factorial designs provide more information about the relationships between categorical predictor variables and responses on the dependent variables than is provided by corresponding one-way or main effect designs.

When many factors are being investigated, however, full-factorial designs sometimes require more data than reasonably can be collected to represent all possible combinations of levels of the factors, and high-order interactions between many factors can become difficult to interpret. With many factors, a useful alternative to the full-factorial design is the fractional factorial design. As an example, consider a 2 x 2 x 2 fractional factorial design to degree 2 with 3 categorical predictor variables each with 2 levels. The design would include the main effects for each variable, and all 2-way interactions between the three variables, but would not include the 3-way interaction between all three variables. Using the overparameterized model, the X matrix for this design is

The 2-way interactions are the highest degree effects included in the design. These types of designs are discussed in detail the 2**(k-p) Fractional Factorial Designs section of Experimental Design.

Nested ANOVA Designs. Nested designs are similar to fractional factorial designs in that all possible combinations of the levels of the categorical predictor variables are not represented in the design. In nested designs, however, the omitted effects are lower-order effects. Nested effects are effects in which the nested variables never appear as main effects. Suppose that for 2 variables A and B with 3 and 2 levels, respectively, the design includes the main effect for A and the effect of B nested within the levels of A. The X matrix for this design using the overparameterized model is

Note that if the sigma-restricted coding were used, there would be only 2 columns in the X matrix for the B nested within A effect instead of the 6 columns in the X matrix for this effect when the overparameterized model coding is used (i.e., columns X4 through X9). The sigma-restricted coding method is overly-restrictive for nested designs, so only the overparameterized model is used to represent nested designs.

Simple Regression. Simple regression designs involve a single continuous predictor variable. If there were 3 cases with values on a predictor variable P of, say, 7, 4, and 9, and the design is for the first-order effect of P, the X matrix would be

and using P for X1 the regression equation would be

Y = b0 + b1P

If the simple regression design is for a higher-order effect of P, say the quadratic effect, the values in the X1 column of the design matrix would be raised to the 2nd power, that is, squared

and using P2 for X1 the regression equation would be

Y = b0 + b1P2

The sigma-restricted and overparameterized coding methods do not apply to simple regression designs and any other design containing only continuous predictors (since there are no categorical predictors to code). Regardless of which coding method is chosen, values on the continuous predictor variables are raised to the desired power and used as the values for the X variables. No recoding is performed. It is therefore sufficient, in describing regression designs, to simply describe the regression equation without explicitly describing the design matrix X.

Multiple Regression. Multiple regression designs are to continuous predictor variables as main effect ANOVA designs are to categorical predictor variables, that is, multiple regression designs contain the separate simple regression designs for 2 or more continuous predictor variables. The regression equation for a multiple regression design for the first-order effects of 3 continuous predictor variables P, Q, and R would be

Y = b0 + b1P + b2Q + b3R

Factorial Regression. Factorial regression designs are similar to factorial ANOVA designs, in which combinations of the levels of the factors are represented in the design. In factorial regression designs, however, there may be many more such possible combinations of distinct levels for the continuous predictor variables than there are cases in the data set. To simplify matters, full-factorial regression designs are defined as designs in which all possible products of the continuous predictor variables are represented in the design. For example, the full-factorial regression design for two continuous predictor variables P and Q would include the main effects (i.e., the first-order effects) of P and Q and their 2-way P by Q interaction effect, which is represented by the product of P and Q scores for each case. The regression equation would be

Y = b0 + b1P + b2Q + b3P*Q

Factorial regression designs can also be fractional, that is, higher-order effects can be omitted from the design. A fractional factorial design to degree 2 for 3 continuous predictor variables P, Q, and R would include the main effects and all 2-way interactions between the predictor variables

Y = b0 + b1P + b2Q + b3R + b4P*Q + b5P*R + b6Q*R

Polynomial Regression. Polynomial regression designs are designs which contain main effects and higher-order effects for the continuous predictor variables but do not include interaction effects between predictor variables. For example, the polynomial regression design to degree 2 for three continuous predictor variables P, Q, and R would include the main effects (i.e., the first-order effects) of P, Q, and R and their quadratic (i.e., second-order)effects, but not the 2-way interaction effects or the P by Q by R 3-way interaction effect.

Y = b0 + b1P + b2P2 + b3Q + b4Q2 + b5R + b6R2

Polynomial regression designs do not have to contain all effects up to the same degree for every predictor variable. For example, main, quadratic, and cubic effects could be included in the design for some predictor variables, and effects up the fourth degree could be included in the design for other predictor variables.

Response Surface Regression. Quadratic response surface regression designs are a hybrid type of design with characteristics of both polynomial regression designs and fractional factorial regression designs. Quadratic response surface regression designs contain all the same effects of polynomial regression designs to degree 2 and additionally the 2-way interaction effects of the predictor variables. The regression equation for a quadratic response surface regression design for 3 continuous predictor variables P, Q, and R would be

Y = b0 + b1P + b2P2 + b3Q + b4Q2 + b5R + b6R2 + b7P*Q + b8P*R + b9Q*R

These types of designs are commonly employed in applied research (e.g., in industrial experimentation), and a detailed discussion of these types of designs is also presented in Experimental Design (see Central composite designs).

Analysis of Covariance. In general, between designs which contain both categorical and continuous predictor variables can be called ANCOVA designs. Traditionally, however, ANCOVA designs have referred more specifically to designs in which the first-order effects of one or more continuous predictor variables are taken into account when assessing the effects of one or more categorical predictor variables. A basic introduction to analysis of covariance can also be found in the Analysis of covariance (ANCOVA) section of ANOVA/MANOVA .

To illustrate, suppose a researcher wants to assess the influences of a categorical predictor variable A with 3 levels on some outcome, and that measurements on a continuous predictor variable P, known to covary with the outcome, are available. If the data for the analysis are

then the sigma-restricted X matrix for the design that includes the separate first-order effects of P and A would be

The b2 and b3 coefficients in the regression equation

Y = b0 + b1X1 + b2X2 + b3X3

represent the influences of group membership on the A categorical predictor variable, controlling for the influence of scores on the P continuous predictor variable. Similarly, the b1 coefficient represents the influence of scores on P controlling for the influences of group membership on A. This traditional ANCOVA analysis gives a more sensitive test of the influence of A to the extent that P reduces the prediction error, that is, the residuals for the outcome variable.

The X matrix for the same design using the overparameterized model would be

The interpretation is unchanged except that the influences of group membership on the A categorical predictor variables are represented by the b2, b3 and b4 coefficients in the regression equation

Y = b0 + b1X1 + b2X2 + b3X3 + b4X4

Separate Slope Designs. The traditional analysis of covariance (ANCOVA) design for categorical and continuous predictor variables is inappropriate when the categorical and continuous predictors interact in influencing responses on the outcome. The appropriate design for modeling the influences of the predictors in this situation is called the separate slope design. For the same example data used to illustrate traditional ANCOVA, the overparameterized X matrix for the design that includes the main effect of the three-level categorical predictor A and the 2-way interaction of P by A would be

The b4, b5, and b6 coefficients in the regression equation

Y = b0 + b1X1 + b2X2 + b3X3 + b4X4 + b5X5 + b6X6

give the separate slopes for the regression of the outcome on P within each group on A, controlling for the main effect of A.

As with nested ANOVA designs, the sigma-restricted coding of effects for separate slope designs is overly restrictive, so only the overparameterized model is used to represent separate slope designs. In fact, separate slope designs are identical in form to nested ANOVA designs, since the main effects for continuous predictors are omitted in separate slope designs.

Homogeneity of Slopes. The appropriate design for modeling the influences of continuous and categorical predictor variables depends on whether the continuous and categorical predictors interact in influencing the outcome. The traditional analysis of covariance (ANCOVA) design for continuous and categorical predictor variables is appropriate when the continuous and categorical predictors do not interact in influencing responses on the outcome, and the separate slope design is appropriate when the continuous and categorical predictors do interact in influencing responses. The homogeneity of slopes designs can be used to test whether the continuous and categorical predictors interact in influencing responses, and thus, whether the traditional ANCOVA design or the separate slope design is appropriate for modeling the effects of the predictors. For the same example data used to illustrate the traditional ANCOVA and separate slope designs, the overparameterized X matrix for the design that includes the main effect of P, the main effect of the three-level categorical predictor A, and the 2-way interaction of P by A would be

If the b5, b6, or b7 coefficient in the regression equation

Y = b0 + b1X1 + b2X2 + b3X3 + b4X4 + b5X5 + b6X6 + b7X7

is non-zero, the separate slope model should be used. If instead all 3 of these regression coefficients are zero the traditional ANCOVA design should be used.

The sigma-restricted X matrix for the homogeneity of slopes design would be

Using this X matrix, if the b4, or b5 coefficient in the regression equation

Y = b0 + b1X1 + b2X2 + b3X3 + b4X4 + b5X5

is non-zero, the separate slope model should be used. If instead both of these regression coefficients are zero the traditional ANCOVA design should be used.

Distance Graphs

A graphic technique that is useful in analyzing Partial Least Squares designs is a distance graph. These graphs allow you to compute and plot distances from the origin (zero for all dimensions) for the predicted and residual statistics, loadings, and weights for the respective number of components.

Based on Euclidean distances, these observation plots can be helpful in determining major contributors to the prediction of the conceptual variable(s) (plotting weights) as well as outliers that have a disproportionate influence (relative to the other observation) on the results (plotting residual values).

How to Analyze Data with Low Quality or Small Samples, Nonparametric Statistics

Brief review of the idea of significance testing. To understand the idea of nonparametric statistics (the term nonparametric was first used by Wolfowitz, 1942) first requires a basic understanding of parametric statistics. Elementary Concepts introduces the concept of statistical significance testing based on the sampling distribution of a particular statistic (you may want to review that topic before reading on). In short, if we have a basic knowledge of the underlying distribution of a variable, then we can make predictions about how, in repeated samples of equal size, this particular statistic will “behave,” that is, how it is distributed. For example, if we draw 100 random samples of 100 adults each from the general population, and compute the mean height in each sample, then the distribution of the standardized means across samples will likely approximate the normal distribution (to be precise, Student’s t distribution with 99 degrees of freedom; see below). Now imagine that we take an additional sample in a particular city (“Tallburg”) where we suspect that people are taller than the average population. If the mean height in that sample falls outside the upper 95% tail area of the t distribution then we conclude that, indeed, the people of Tallburg are taller than the average population.

Are most variables normally distributed? In the above example we relied on our knowledge that, in repeated samples of equal size, the standardized means (for height) will be distributed following the t distribution (with a particular mean and variance). However, this will only be true if in the population the variable of interest (height in our example) is normally distributed, that is, if the distribution of people of particular heights follows the normal distribution (the bell-shape distribution).

For many variables of interest, we simply do not know for sure that this is the case. For example, is income distributed normally in the population? — probably not. The incidence rates of rare diseases are not normally distributed in the population, the number of car accidents is also not normally distributed, and neither are very many other variables in which a researcher might be interested.

For more information on the normal distribution, see Elementary Concepts; for information on tests of normality, see Normality tests.

Sample size. Another factor that often limits the applicability of tests based on the assumption that the sampling distribution is normal is the size of the sample of data available for the analysis (sample size; n). We can assume that the sampling distribution is normal even if we are not sure that the distribution of the variable in the population is normal, as long as our sample is large enough (e.g., 100 or more observations). However, if our sample is very small, then those tests can be used only if we are sure that the variable is normally distributed, and there is no way to test this assumption if the sample is small.

Problems in measurement. Applications of tests that are based on the normality assumptions are further limited by a lack of precise measurement. For example, let us consider a study where grade point average (GPA) is measured as the major variable of interest. Is an A average twice as good as a C average? Is the difference between a B and an A average comparable to the difference between a D and a C average? Somehow, the GPA is a crude measure of scholastic accomplishments that only allows us to establish a rank ordering of students from “good” students to “poor” students. This general measurement issue is usually discussed in statistics textbooks in terms of types of measurement or scale of measurement. Without going into too much detail, most common statistical techniques such as analysis of variance (and t– tests), regression, etc., assume that the underlying measurements are at least of interval, meaning that equally spaced intervals on the scale can be compared in a meaningful manner (e.g, B minus A is equal to D minus C). However, as in our example, this assumption is very often not tenable, and the data rather represent a rank ordering of observations (ordinal) rather than precise measurements.

Parametric and nonparametric methods. Hopefully, after this somewhat lengthy introduction, the need is evident for statistical procedures that enable us to process data of “low quality,” from small samples, on variables about which nothing is known (concerning their distribution). Specifically, nonparametric methods were developed to be used in cases when the researcher knows nothing about the parameters of the variable of interest in the population (hence the name nonparametric). In more technical terms, nonparametric methods do not rely on the estimation of parameters (such as the mean or the standard deviation) describing the distribution of the variable of interest in the population. Therefore, these methods are also sometimes (and more appropriately) called parameter-free methods or distribution-free methods.

Brief Overview of Nonparametric Methods

Basically, there is at least one nonparametric equivalent for each parametric general type of test. In general, these tests fall into the following categories:

  • Tests of differences between groups (independent samples);
  • Tests of differences between variables (dependent samples);
  • Tests of relationships between variables.

Differences between independent groups. Usually, when we have two samples that we want to compare concerning their mean value for some variable of interest, we would use the t-test for independent samples); nonparametric alternatives for this test are the Wald-Wolfowitz runs test, the Mann-Whitney U test, and the Kolmogorov-Smirnov two-sample test. If we have multiple groups, we would use analysis of variance (see ANOVA/MANOVA; the nonparametric equivalents to this method are the Kruskal-Wallis analysis of ranks and the Median test.

Differences between dependent groups. If we want to compare two variables measured in the same sample we would customarily use the t-test for dependent samples (in Basic Statistics for example, if we wanted to compare students’ math skills at the beginning of the semester with their skills at the end of the semester). Nonparametric alternatives to this test are the Sign test and Wilcoxon’s matched pairs test. If the variables of interest are dichotomous in nature (i.e., “pass” vs. “no pass”) then McNemar’s Chi-square test is appropriate. If there are more than two variables that were measured in the same sample, then we would customarily use repeated measures ANOVA. Nonparametric alternatives to this method are Friedman’s two-way analysis of variance and Cochran Q test (if the variable was measured in terms of categories, e.g., “passed” vs. “failed”). Cochran Q is particularly useful for measuring changes in frequencies (proportions) across time.

Relationships between variables. To express a relationship between two variables one usually computes the correlation coefficient. Nonparametric equivalents to the standard correlation coefficient are Spearman R, Kendall Tau, and coefficient Gamma (see Nonparametric correlations). If the two variables of interest are categorical in nature (e.g., “passed” vs. “failed” by “male” vs. “female”) appropriate nonparametric statistics for testing the relationship between the two variables are the Chi-square test, the Phi coefficient, and the Fisher exact test. In addition, a simultaneous test for relationships between multiple cases is available: Kendall coefficient of concordance. This test is often used for expressing inter-rater agreement among independent judges who are rating (ranking) the same stimuli.

Descriptive statistics. When one’s data are not normally distributed, and the measurements at best contain rank order information, then computing the standard descriptive statistics (e.g., mean, standard deviation) is sometimes not the most informative way to summarize the data. For example, in the area of psychometrics it is well known that the rated intensity of a stimulus (e.g., perceived brightness of a light) is often a logarithmic function of the actual intensity of the stimulus (brightness as measured in objective units of Lux). In this example, the simple mean rating (sum of ratings divided by the number of stimuli) is not an adequate summary of the average actual intensity of the stimuli. (In this example, one would probably rather compute the geometric mean.) Nonparametrics and Distributions will compute a wide variety of measures of location (mean, median, mode, etc.) and dispersion (variance, average deviation, quartile range, etc.) to provide the “complete picture” of one’s data.



When to Use Which Method

It is not easy to give simple advice concerning the use of nonparametric procedures. Each nonparametric procedure has its peculiar sensitivities and blind spots. For example, the Kolmogorov-Smirnov two-sample test is not only sensitive to differences in the location of distributions (for example, differences in means) but is also greatly affected by differences in their shapes. The Wilcoxon matched pairs test assumes that one can rank order the magnitude of differences in matched observations in a meaningful manner. If this is not the case, one should rather use the Sign test. In general, if the result of a study is important (e.g., does a very expensive and painful drug therapy help people get better?), then it is always advisable to run different nonparametric tests; should discrepancies in the results occur contingent upon which test is used, one should try to understand why some tests give different results. On the other hand, nonparametric statistics are less statistically powerful (sensitive) than their parametric counterparts, and if it is important to detect even small effects (e.g., is this food additive harmful to people?) one should be very careful in the choice of a test statistic.

Large data sets and nonparametric methods. Nonparametric methods are most appropriate when the sample sizes are small. When the data set is large (e.g., n > 100) it often makes little sense to use nonparametric statistics at all. Elementary Concepts briefly discusses the idea of the central limit theorem. In a nutshell, when the samples become very large, then the sample means will follow the normal distribution even if the respective variable is not normally distributed in the population, or is not measured very well. Thus, parametric methods, which are usually much more sensitive (i.e., have more statistical power) are in most cases appropriate for large samples. However, the tests of significance of many of the nonparametric statistics described here are based on asymptotic (large sample) theory; therefore, meaningful tests can often not be performed if the sample sizes become too small. Please refer to the descriptions of the specific tests to learn more about their power and efficiency.

Nonparametric Correlations

The following are three types of commonly used nonparametric correlation coefficients (Spearman R, Kendall Tau, and Gamma coefficients). Note that the chi-square statistic computed for two-way frequency tables, also provides a careful measure of a relation between the two (tabulated) variables, and unlike the correlation measures listed below, it can be used for variables that are measured on a simple nominal scale.

Spearman R. Spearman R (Siegel & Castellan, 1988) assumes that the variables under consideration were measured on at least an ordinal (rank order) scale, that is, that the individual observations can be ranked into two ordered series. Spearman R can be thought of as the regular Pearson product moment correlation coefficient, that is, in terms of proportion of variability accounted for, except that Spearman R is computed from ranks.

Kendall tau. Kendall tau is equivalent to Spearman R with regard to the underlying assumptions. It is also comparable in terms of its statistical power. However, Spearman R and Kendall tau are usually not identical in magnitude because their underlying logic as well as their computational formulas are very different. Siegel and Castellan (1988) express the relationship of the two measures in terms of the inequality: More importantly, Kendall tau and Spearman R imply different interpretations: Spearman R can be thought of as the regular Pearson product moment correlation coefficient, that is, in terms of proportion of variability accounted for, except that Spearman R is computed from ranks. Kendall tau, on the other hand, represents a probability, that is, it is the difference between the probability that in the observed data the two variables are in the same order versus the probability that the two variables are in different orders.

-1 £ 3 * Kendall tau – 2 * Spearman R £ 1

Gamma. The Gamma statistic (Siegel & Castellan, 1988) is preferable to Spearman R or Kendall tau when the data contain many tied observations. In terms of the underlying assumptions, Gamma is equivalent to Spearman R or Kendall tau; in terms of its interpretation and computation it is more similar to Kendall tau than Spearman R. In short, Gamma is also a probability; specifically, it is computed as the difference between the probability that the rank ordering of the two variables agree minus the probability that they disagree, divided by 1 minus the probability of ties. Thus, Gamma is basically equivalent to Kendall tau, except that ties are explicitly taken into account.

How to Calculate the Relationship between Independent Variables and a Dependent Variable, Nonlinear Estimation

General Purpose

In the most general terms, Nonlinear Estimation will compute the relationship between a set of independent variables and a dependent variable. For example, we may want to compute the relationship between the dose of a drug and its effectiveness, the relationship between training and subsequent performance on a task, the relationship between the price of a house and the time it takes to sell it, etc. You may recognize research issues in these examples that are commonly addressed by such techniques as multiple regression (see Multiple Regression) or analysis of variance (see ANOVA/MANOVA). In fact, you may think of Nonlinear Estimation as a generalization of those methods. Specifically, multiple regression (and ANOVA) assumes that the relationship between the independent variable(s) and the dependent variable is linear in nature. Nonlinear Estimation leaves it up to you to specify the nature of the relationship; for example, you may specify the dependent variable to be a logarithmic function of the independent variable(s), an exponential function, a function of some complex ratio of independent measures, etc. (However, if all variables of interest are categorical in nature, or can be converted into categorical variables, you may also consider Correspondence Analysis.)

When allowing for any type of relationship between the independent variables and the dependent variable, two issues raise their heads. First, what types of relationships “make sense”, that is, are interpretable in a meaningful manner? Note that the simple linear relationship is very convenient in that it allows us to make such straightforward interpretations as “the more of x (e.g., the higher the price of a house), the more there is of y (the longer it takes to sell it); and given a particular increase in x, a proportional increase in y can be expected.” Nonlinear relationships cannot usually be interpreted and verbalized in such a simple manner. The second issue that needs to be addressed is how to exactly compute the relationship, that is, how to arrive at results that allow us to say whether or not there is a nonlinear relationship as predicted.

Let us now discuss the nonlinear regression problem in a somewhat more formal manner, that is, introduce the common terminology that will allow us to examine the nature of these techniques more closely, and how they are used to address important questions in various research domains (medicine, social sciences, physics, chemistry, pharmacology, engineering, etc.).

Estimating Linear and Nonlinear Models

Technically speaking, Nonlinear Estimation is a general fitting procedure that will estimate any kind of relationship between a dependent (or response variable), and a list of independent variables. In general, all regression models may be stated as:

y = F(x1, x2, … , xn)

In most general terms, we are interested in whether and how a dependent variable is related to a list of independent variables; the term F(x…) in the expression above means that y, the dependent or response variable, is a function of the x‘s, that is, the independent variables.

An example of this type of model would be the linear multiple regression model as described in Multiple Regression. For this model, we assume the dependent variable to be a linear function of the independent variables, that is:

y = a + b1*x1 + b2*x2 + … + bn*xn

If you are not familiar with multiple linear regression, you may want to read the introductory section to Multiple Regression at this point (however, it is not necessary to understand all of the nuances of multiple linear regression techniques in order to understand the methods discussed here).

Nonlinear Estimation allows you to specify essentially any type of continuous or discontinuous regression model. Some of the most common nonlinear models are probit, logit, exponential growth, and breakpoint regression. However, you can also define any type of regression equation to fit to your data. Moreover, you can specify either standard least squares estimation, maximum likelihood estimation (where appropriate), or, again, define your own “loss function” (see below) by defining the respective equation.

In general, whenever the simple linear regression model does not appear to adequately represent the relationships between variables, then the nonlinear regression model approach is appropriate. See the following topics for overviews of the common nonlinear regression models, nonlinear estimation procedures, and evaluation of the fit of the data to the nonlinear model.

Common Nonlinear Regression Models

Intrinsically Linear Regression Models

Polynomial Regression. A common “nonlinear” model is polynomial regression. We put the term nonlinear in quotes here because the nature of this model is actually linear. For example, suppose we measure in a learning experiment subjects’ physiological arousal and their performance on a complex tracking task. Based on the well-known Yerkes-Dodson law we could expect a curvilinear relationship between arousal and performance; this expectation can be expressed in the regression equation:

Performance = a + b1*Arousal + b2*Arousal2

In this equation, a represents the intercept, and b1 and b2 are regression coefficients. The non-linearity of this model is expressed in the term Arousal2. However, the nature of the model is still linear, except that when estimating it, we would square the measure of arousal. These types of models, where we include some transformation of the independent variables in a linear equation, are also referred to as models that are nonlinear in the variables.

Models that are nonlinear in the parameters. To contrast the example above, consider the relationship between a human’s age from birth (the x variable) and his or her growth rate (the y variable). Clearly, the relationship between these two variables in the first year of a person’s life (when most growth occurs) is very different than during adulthood (when almost no growth occurs). Thus, the relationship could probably best be expressed in terms of some negative exponential function:

Growth = exp(-b1*Age)

If you plotted this relationship for a particular estimate of the regression coefficient you would obtain a curve that looks something like this.

Note that the nature of this model is no longer linear, that is, the expression shown above does not simply represent a linear regression model, with some transformation of the independent variable. This type of model is said to be nonlinear in the parameters.

Making nonlinear models linear. In general, whenever a regression model can be “made” into a linear model, this is the preferred route to pursue (for estimating the respective model). The linear multiple regression model (see Multiple Regression) is very well understood mathematically, and, from a pragmatic standpoint, is most easily interpreted. Therefore, returning to the simple exponential regression model of Growth as a function of Age shown above, we could convert this nonlinear regression equation into a linear one by simply taking the logarithm of both sides of the equations, so that:

log(Growth) = -b1*Age

If we now substitute log(Growth) with y, we have the standard linear regression model as shown earlier (without the intercept which was ignored here to simplify matters). Thus, we could log-transform the Growth rate data and then use Multiple Regression to estimate the relationship between Age and Growth, that is, compute the regression coefficient b1.

Model adequacy. Of course, by using the “wrong” transformation, you could end up with an inadequate model. Therefore, after “linearizing” a model such as the one shown above, it is particularly important to use extensive residual statistics in Multiple Regression.

Intrinsically Nonlinear Regression Models

Some regression models which cannot be transformed into linear ones, can only be estimated via Nonlinear Estimation. In the growth rate example above, we purposely “forgot” about the random error in the dependent variable. Of course, the growth rate is affected by very many other variables (other than time), and we can expect a considerable amount of random (residual) fluctuation around the fitted line. If we add this error or residual variability to the model, we could rewrite it as follows:

Growth = exp(-b1*Age) + error

Additive error. In this model we assume that the error variability is independent of age, that is, that the amount of residual error variability is the same at any age. Because the error term in this model is additive, you can no longer linearize this model by taking the logarithm of both sides. If for a given data set, you were to log-transform variable Growth anyway and fit the simple linear model, then you would find that the residuals from the analysis would no longer be evenly distributed over the range of variable Age; and thus, the standard linear regression analysis (via Multiple Regression) would no longer be appropriate. Therefore, the only way to estimate the parameters for this model is via Nonlinear Estimation.

Multiplicative error. To “defend” our previous example, in this particular instance it is not likely that the error variability is constant at all ages, that is, that the error is additive. Most likely, there is more random and unpredictable fluctuation of the growth rate at the earlier ages than the later ages, when growth comes to a virtual standstill anyway. Thus, a more realistic model including the error would be:

Growth = exp(-b1*Age) * error

Put in words, the greater the age, the smaller the term exp(-b1*Age), and, consequently, the smaller the resultant error variability. If we now take the log of both sides of the equation, the residual error term will become an additive factor in a linear equation, and we can go ahead and estimate b1 via standard multiple regression.

Log (Growth) = -b1*Age + error

Let us now consider some regression models (that are nonlinear in their parameters) which cannot be “made into” linear models through simple transformations of the raw data.

General Growth Model. The general growth model, is similar to the example that we previously considered:

y = b0 + b1*exp(b2*x) + error

This model is commonly used in studies of any kind of growth (y), when the rate of growth at any given point in time (x) is proportional to the amount of growth remaining. The parameter b0 in this model represents the maximum growth value. A typical example where this model would be adequate is when you want to describe the concentration of a substance (e.g., in water) as a function of elapsed time.

Models for Binary Responses: Probit & Logit. It is not uncommon that a dependent or response variable is binary in nature, that is, that it can have only two possible values. For example, patients either do or do not recover from an injury; job applicants either succeed or fail at an employment test, subscribers to a journal either do or do not renew a subscription, coupons may or may not be returned, etc. In all of these cases, you may be interested in estimating a model that describes the relationship between one or more continuous independent variable(s) to the binary dependent variable.

Using linear regression. Of course, you could use standard multiple regression procedures to compute standard regression coefficients. For example, if you studied the renewal of journal subscriptions, you could create a y variable with 1‘s and 0‘s, where 1 indicates that the respective subscriber renewed, and 0 indicates that the subscriber did not renew. However, there is a problem: Multiple Regression does not “know” that the response variable is binary in nature. Therefore, it will inevitably fit a model that leads to predicted values that are greater than 1 or less than 0. However, predicted values that are greater than 1 or less than 0 are not valid; thus, the restriction in the range of the binary variable (e.g., between 0 and 1) is ignored if you use the standard multiple regression procedure.

Continuous response functions. We could rephrase the regression problem so that, rather than predicting a binary variable, we are predicting a continuous variable that naturally stays within the 0-1 bounds. The two most common regression models that accomplish exactly this are the logit and the probit regression models.

Logit regression. In the logit regression model, the predicted values for the dependent variable will never be less than (or equal to) 0, or greater than (or equal to) 1, regardless of the values of the independent variables. This is accomplished by applying the following regression equation, which actually has some “deeper meaning” as we will see shortly (the term logit was first used by Berkson, 1944):

y = exp(b0 + b1*x1 + … + bn*xn)/{1 + exp(b0 + b1*x1 + … + bn*xn)}

You can easily recognize that, regardless of the regression coefficients or the magnitude of the x values, this model will always produce predicted values (predicted y‘s) in the range of 0 to 1.

The name logit stems from the fact that you can easily linearize this model via the logit transformation. Suppose we think of the binary dependent variable y in terms of an underlying continuous probability p, ranging from 0 to 1. We can then transform that probability p as:

p’ = loge{p/(1-p)}

This transformation is referred to as the logit or logistic transformation. Note that p’ can theoretically assume any value between minus and plus infinity. Since the logit transform solves the issue of the 0/1 boundaries for the original dependent variable (probability), we could use those (logit transformed) values in an ordinary linear regression equation. In fact, if we perform the logit transform on both sides of the logit regression equation stated earlier, we obtain the standard linear regression model:

p’ = b0 + b1*x1 + b2*x2 + … + bn*xn

Probit regression. You may consider the binary response variable to be the result of a normally distributed underlying variable that actually ranges from minus infinity to positive infinity. For example, a subscriber to a journal can feel very strongly about not renewing a subscription, be almost undecided, “tend towards” renewing the subscription, or feel very much in favor of renewing the subscription. In any event, all that we (the publisher of the journal) will see is the binary response of renewal or failure to renew the subscription. However, if we set up the standard linear regression equation based on the underlying “feeling” or attitude we could write:

feeling… = b0 + b1*x1 + …

which is, of course, the standard regression model. It is reasonable to assume that these feelings are normally distributed, and that the probability p of renewing the subscription is about equal to the relative space under the normal curve. Therefore, if we transform each side of the equation so as to reflect normal probabilities, we obtain:

NP(feeling…) = NP(b0 + b1*x1 + …)

where NP stands for normal probability (space under the normal curve), as tabulated in practically all statistics texts. The equation shown above is also referred to as the probit regression model. (The term probit was first used by Bliss, 1934.)

General Logistic Regression Model. The general logistic model can be stated as:

y = b0/{1 + b1*exp(b2*x)}

You can think of this model as an extension of the logit or logistic model for binary responses. However, while the logit model restricts the dependent response variable to only two values, this model allows the response to vary within a particular lower and upper limit. For example, suppose we are interested in the population growth of a species that is introduced to a new habitat, as a function of time. The dependent variable would be the number of individuals of that species in the respective habitat. Obviously, there is a lower limit on the dependent variable, since fewer than 0 individuals cannot exist in the habitat; however, there also is most likely an upper limit that will be reached at some point in time.

Drug Responsiveness and Half-Maximal Response. In pharmacology, the following model is often used to describe the effects of different dose levels of a drug:

y = b0 – b0/{1 + (x/b2)b1}

In this model, x is the dose level (usually in some coded form, so that x 1) and y is the responsiveness, in terms of the percent of maximum possible responsiveness. The parameter b0 then denotes the expected response at the level of dose saturation and b2 is the concentration that produces a half- maximal response; the parameter b1 determines the slope of the function.

Discontinuous Regression Models
Piecewise linear regression. It is not uncommon that the nature of the relationship between one or more independent variables and a dependent variable changes over the range of the independent variables. For example, suppose we monitor the per-unit manufacturing cost of a particular product as a function of the number of units manufactured (output) per month. In general, the more units per month we produce, the lower is our per-unit cost, and this linear relationship may hold over a wide range of different levels of production output. However, it is conceivable that above a certain point, there is a discontinuity in the relationship between these two variables. For example, the per-unit cost may decrease relatively less quickly when older (less efficient) machines have to be put on-line in order to cope with the larger volume. Suppose that the older machines go on-line when the production output rises above 500 units per month; we may specify a regression model for cost-per-unit as:

y = b0 + b1*x*(x 500) + b2*x*(x > 500)

In this formula, y stands for the estimated per-unit cost; x is the output per month. The expressions (x 500) and (x > 500) denote logical conditions that evaluate to 0 if false, and to 1 if true. Thus, this model specifies a common intercept (b0), and a slope that is either equal to b1 (if x 500 is true, that is, equal to 1) or b2 (if x > 500 is true, that is, equal to 1).

Instead of specifying the point where the discontinuity in the regression line occurs (at 500 units per months in the example above), you could also estimate that point. For example, you might have noticed or suspected that there is a discontinuity in the cost-per-unit at one particular point; however, you may not know where that point is. In that case, simply replace the 500 in the equation above with an additional parameter (e.g., b3).

Breakpoint regressionYou could also adjust the equation above to reflect a “jump” in the regression line. For example, imagine that, after the older machines are put on-line, the per-unit-cost jumps to a higher level, and then slowly goes down as volume continues to increase. In that case, simply specify an additional intercept (b3), so that:

y = (b0 + b1*x)*(x 500) + (b3 + b2*x)*(x > 500)

Comparing groups.The method described here to estimate different regression equations in different domains of the independent variable can also be used to distinguish between groups. For example, suppose in the example above, there are three different plants; to simplify the example, let us ignore the breakpoint for now. If we coded the three plants in a grouping variable by using the values 1, 2, and 3, we could simultaneously estimate three different regression equations by specifying:

y = (xp=1)*(b10 + b11*x) + (xp=2)*(b20 + b21*x) + (xp=3)*(b30 + b31*x)

In this equation, xp denotes the grouping variable containing the codes that identify each plant, b10, b20, and b30 are the three different intercepts, and b11, b21, and b31 refer to the slope parameters (regression coefficients) for each plant. You could compare the fit of the common regression model without considering the different groups (plants) with this model in order to determine which model is more appropriate.

Nonlinear Estimation Procedures

Least Squares Estimation. Some of the more common nonlinear regression models are reviewed in Common Nonlinear Regression Models. Now, the question arises as to how these models are estimated. If you are familiar with linear regression techniques (as described in Multiple Regression) or analysis of variance (ANOVA) techniques (as described in ANOVA/MANOVA), then you may be aware of the fact that all of those methods use so-called least squares estimation procedures. In the most general terms, least squares estimation is aimed at minimizing the sum of squared deviations of the observed values for the dependent variable from those predicted by the model. (The term least squares was first used by Legendre, 1805.)

Loss Functions. In standard multiple regression we estimate the regression coefficients by “finding” those coefficients that minimize the residual variance (sum of squared residuals) around the regression line. Any deviation of an observed score from a predicted score signifies some loss in the accuracy of our prediction, for example, due to random noise (error). Therefore, we can say that the goal of least squares estimation is to minimize a loss function; specifically, this loss function is defined as the sum of the squared deviation about the predicted values (the term loss was first used by Wald, 1939). When this function is at its minimum, then we get the same parameter estimates (intercept, regression coefficients) as we would in Multiple Regression; because of the particular loss functions that yielded those estimates, we can call the estimates least squares estimates.

Phrased in this manner, there is no reason why you cannot consider other loss functions. For example, rather than minimizing the sum of squared deviations, why not minimize the sum of absolute deviations? Indeed, this is sometimes useful in order to “de-emphasize” outliers. Relative to all other residuals, a large residual will become much larger when squared. However, if you only take the absolute value of the deviations, then the resulting regression line will most likely be less affected by outliers.

There are several function minimization methods that can be used to minimize any kind of loss function. For more information, see:

Weighted Least Squares. In addition to least squares and absolute deviation regression (see above), weighted least squares estimation is probably the most commonly used technique. Ordinary least squares techniques assume that the residual variance around the regression line is the same across all values of the independent variable(s). Put another way, it is assumed that the error variance in the measurement of each case is identical. Often, this is not a realistic assumption; in particular, violations frequently occur in business, economic, or biological applications.

For example, suppose we wanted to study the relationship between the projected cost of construction projects, and the actual cost. This may be useful in order to gage the expected cost overruns. In this case it is reasonable to assume that the absolute magnitude (dollar amount) by which the estimates are off, is proportional to the size of the project. Thus, we would use a weighted least squares loss function to fit a linear regression model. Specifically, the loss function would be (see, for example, Neter, Wasserman, & Kutner, 1985, p. 168):

Loss = (Obs-Pred)2 * (1/x2)

In this equation, the loss function first specifies the standard least squares loss function (Observed minus Predicted squared; i.e., the squared residual), and then weighs this loss by the inverse of the squared value of the independent variable (x) for each case. In the actual estimation, you sum up the value of the loss function for each case (e.g., construction project), as specified above, and estimate the parameters that minimize that sum. To return to our example, the larger the project (x) the less weight is placed on the deviation from the predicted value (cost). This method will yield more stable estimates of the regression parameters (for more details, see Neter, Wasserman, & Kutner, 1985).

Maximum Likelihood. An alternative to the least squares loss function (see above) is to maximize the likelihood or log-likelihood function (or to minimize the negative log-likelihood function; the term maximum likelihood was first used by Fisher, 1922a). In most general terms, the likelihood function is defined as:

L = F(Y,Model) = in= 1 {p [yi, Model Parameters(xi)]}

In theory, we can compute the probability (now called L, the likelihood) of the specific dependent variable values to occur in our sample, given the respective regression model. Provided that all observations are independent of each other, this likelihood is the geometric sum (, across i = 1 to n cases) of probabilities for each individual observation (i) to occur, given the respective model and parameters for the x values. (The geometric sum means that we would multiply out the individual probabilities across cases.) It is also customary to express this function as a natural logarithm, in which case the geometric sum becomes a regular arithmetic sum (, across i = 1 to n cases).

Given the respective model, the larger the likelihood of the model, the larger is the probability of the dependent variable values to occur in the sample. Therefore, the greater the likelihood, the better is the fit of the model to the data. The actual computations for particular models here can become quite complicated because we need to “track” (compute) the probabilities of the y-values to occur (given the model and the respective x– values). As it turns out, if all assumptions for standard multiple regression are met (as described in the Multiple Regression chapter in the manual), then the standard least squares estimation method (see above) will yield results identical to the maximum likelihood method. If the assumption of equal error variances across the range of the x variable(s) is violated, then the weighted least squares method described earlier will yield maximum likelihood estimates.

Maximum Likelihood and Probit/Logit Models. The maximum likelihood function has been “worked out” for probit and logit regression models. Specifically, the loss function for these models is computed as the sum of the natural log of the logit or probit likelihood L1 so that:

log(L1) = in= 1 [yi*log(pi ) + (1-yi )*log(1-pi )]

log(L1) is the natural log of the (logit or probit) likelihood (log-likelihood) for the current model
yi is the observed value for case i
pi is the expected (predicted or fitted) probability (between 0 and 1)

The log-likelihood of the null model (L0), that is, the model containing the intercept only (and no regression coefficients) is computed as:

log(L0) = n0*(log(n0/n)) + n1*(log(n1/n))

log(L0) is the natural log of the (logit or probit) likelihood of the null model (intercept only)
n0 is the number of observations with a value of 0 (zero)
n1 is the number of observations with a value of 1
n is the total number of observations

Function Minimization Algorithms. Now that we have discussed different regression models, and the loss functions that can be used to estimate them, the only “mystery” that is left is how to minimize the loss functions (to find the best fitting set of parameters), and how to estimate the standard errors of the parameter estimates. There is one very efficient algorithm (quasi-Newton) that approximates the second-order derivatives of the loss function to guide the search for the minimum (i.e., for the best parameter estimates, given the respective loss function). In addition, there are several more general function minimization algorithms that follow different search strategies (which do not depend on the second-order derivatives). These strategies are sometimes more effective for estimating loss functions with local minima; therefore, these methods are often particularly useful to find appropriate start values for the estimation via the quasi-Newton method.

In all cases, you can compute the standard errors of the parameter estimates. These standard errors are based on the second-order partial derivatives for the parameters, which are computed via finite difference approximation.

If you are not interested in how the minimization of the loss function is done, only that it can be done, you may skip the following paragraphs. However, you may find it useful to know a little about these procedures in case your regression model “refuses” to be fit to the data. In that case, the iterative estimation procedure will fail to converge, producing ever “stranger” (e.g., very large or very small) parameter estimates.

In the following paragraphs we will first discuss some general issues involved in unconstrained optimization, and then briefly review the methods used. For more detailed discussions of these procedures you may refer to Brent (1973), Gill and Murray (1974), Peressini, Sullivan, and Uhl (1988), and Wilde and Beightler (1967). For specific algorithms, see Dennis and Schnabel (1983), Eason and Fenton (1974), Fletcher (1969), Fletcher and Powell (1963), Fletcher and Reeves (1964), Hooke and Jeeves (1961), Jacoby, Kowalik, and Pizzo (1972), and Nelder and Mead (1964).

Start Values, Step Sizes, Convergence Criteria. A common aspect of all estimation procedures is that they require the user to specify some start values, initial step sizes, and a criterion for convergence . All methods will begin with a particular set of initial estimates (start values), which will be changed in some systematic manner from iteration to iteration; in the first iteration, the step size determines by how much the parameters will be moved. Finally, the convergence criterion determines when the iteration process will stop. For example, the process may stop when the improvements in the loss function from iteration to iteration are less than a specific amount.

Penalty Functions, Constraining Parameters. These estimation procedures are unconstrained in nature. When this happens, it will move parameters around without any regard for whether or not permissible values result. For example, in the course of logit regression we may get estimated values that are equal to 0.0, in which case the logarithm cannot be computed (since the log of 0 is undefined). When this happens, it will assign a penalty to the loss function, that is, a very large value. As a result, the various estimation procedures usually move away from the regions that produce those functions. However, in some circumstances, the estimation will “get stuck,” and as a result, you would see a very large value of the loss function. This could happen, if, for example, the regression equation involves taking the logarithm of an independent variable which has a value of zero for some cases (in which case the logarithm cannot be computed).

If you want to constrain a procedure, then this constraint must be specified in the loss function as a penalty function (assessment). By doing this, you may control what permissible values of the parameters to be estimated may be manipulated. For example, if two parameters (a and b) are to be constrained to be greater than or equal to zero, then you must assess a large penalty to these parameters if this condition is not met. Below is an example of a user-specified regression and loss function, including a penalty assessment designed to “penalize” the parameters a and/or b if either one is not greater than or equal to zero:

Estimated function: v3 = a + b*v1 + (c*v2)
Loss function: L = (obs – pred)**2 + (a<0)*100000 + (b<0)*100000

Local Minima. The most “treacherous” threat to unconstrained function minimization is local minima. For example, a particular loss function may become slightly larger, regardless of how a particular parameter is moved. However, if the parameter were to be moved into a completely different place, the loss function may actually become smaller. You can think of such local minima as local “valleys” or minor “dents” in the loss function. However, in most practical applications, local minima will produce “outrageous” and extremely large or small parameter estimates with very large standard errors. In those cases, specify different start values and try again. Also note, that the Simplex method (see below) is particularly “smart” in avoiding such minima; therefore, this method may be particularly suited in order to find appropriate start values for complex functions.

Quasi-Newton Method. As you may remember, the slope of a function at a particular point can be computed as the first- order derivative of the function (at that point). The “slope of the slope” is the second-order derivative, which tells us how fast the slope is changing at the respective point, and in which direction. The quasi-Newton method will, at each step, evaluate the function at different points in order to estimate the first-order derivatives and second-order derivatives. It will then use this information to follow a path towards the minimum of the loss function.

Simplex Procedure. This algorithm does not rely on the computation or estimation of the derivatives of the loss function. Instead, at each iteration the function will be evaluated at m+1 points in the m dimensional parameter space. For example, in two dimensions (i.e., when there are two parameters to be estimated), it will evaluate the function at three points around the current optimum. These three points would define a triangle; in more than two dimensions, the “figure” produced by these points is called a Simplex. Intuitively, in two dimensions, three points will allow us to determine “which way to go,” that is, in which direction in the two dimensional space to proceed in order to minimize the function. The same principle can be applied to the multidimensional parameter space, that is, the Simplex will “move” downhill; when the current step sizes become too “crude” to detect a clear downhill direction, (i.e., the Simplex is too large), the Simplex will “contract” and try again.

An additional strength of this method is that when a minimum appears to have been found, the Simplex will again be expanded to a larger size to see whether the respective minimum is a local minimum. Thus, in a way, the Simplex moves like a smooth single cell organism down the loss function, contracting and expanding as local minima or significant ridges are encountered.

Hooke-Jeeves Pattern Moves. In a sense this is the simplest of all algorithms. At each iteration, this method first defines a pattern of points by moving each parameter one by one, so as to optimize the current loss function. The entire pattern of points is then shifted or moved to a new location; this new location is determined by extrapolating the line from the old base point in the m dimensional parameter space to the new base point. The step sizes in this process are constantly adjusted to “zero in” on the respective optimum. This method is usually quite effective, and should be tried if both the quasi-Newton and Simplex methods (see above) fail to produce reasonable estimates.

Rosenbrock Pattern Search. Where all other methods fail, the Rosenbrock Pattern Search method often succeeds. This method will rotate the parameter space and align one axis with a ridge (this method is also called the method of rotating coordinates); all other axes will remain orthogonal to this axis. If the loss function is unimodal and has detectable ridges pointing towards the minimum of the function, then this method will proceed with sure-footed accuracy towards the minimum of the function. However, note that this search algorithm may terminate early when there are several constraint boundaries (resulting in the penalty value; see above) that intersect, leading to a discontinuity in the ridges.

Hessian Matrix and Standard Errors. The matrix of second-order (partial) derivatives is also called the Hessian matrix. It turns out that the inverse of the Hessian matrix approximates the variance/covariance matrix of parameter estimates. Intuitively, there should be an inverse relationship between the second-order derivative for a parameter and its standard error: If the change of the slope around the minimum of the function is very sharp, then the second-order derivative will be large; however, the parameter estimate will be quite stable in the sense that the minimum with respect to the parameter is clearly identifiable. If the second-order derivative is nearly zero, then the change in the slope around the minimum is zero, meaning that we can practically move the parameter in any direction without greatly affecting the loss function. Thus, the standard error of the parameter will be very large.

The Hessian matrix (and asymptotic standard errors for the parameters) can be computed via finite difference approximation. This procedure yields very precise asymptotic standard errors for all estimation methods.

Evaluating the Fit of the Model

After estimating the regression parameters, an essential aspect of the analysis is to test the appropriateness of the overall model. For example, if you specified a linear regression model, but the relationship is intrinsically non-linear, then the parameter estimates (regression coefficients) and the estimated standard errors of those estimates may be significantly “off.” Let us review some of the ways to evaluate the appropriateness of a model.

Proportion of Variance Explained. Regardless of the model, you can always compute the total variance of the dependent variable (total sum of squares, SST), the proportion of variance due to the residuals (error sum of squares, SSE), and the proportion of variance due to the regression model (regression sum of squares, SSR=SST-SSE). The ratio of the regression sum of squares to the total sum of squares (SSR/SST) explains the proportion of variance accounted for in the dependent variable (y) by the model; thus, this ratio is equivalent to the R-square (0 R-square 1, the coefficient of determination). Even when the dependent variable is not normally distributed across cases, this measure may help evaluate how well the model fits the data.

Goodness-of-fit Chi-square. For probit and logit regression models, you may use maximum likelihood estimation (i.e., maximize the likelihood function). As it turns out, you can directly compare the likelihood L0 for the null model where all slope parameters are zero, with the likelihood L1 of the fitted model. Specifically, you can compute the Chi-square statistic for this comparison as:

Chi-square = -2 * (log(L0) – log(L1))

The degrees of freedom for this Chi-square value are equal to the difference in the number of parameters for the null and the fitted model; thus, the degrees of freedom will be equal to the number of independent variables in the logit or probit regression. If the p-value associated with this Chi-square is significant, then we can say that the estimated model yields a significantly better fit to the data than the null model, that is, that the regression parameters are statistically significant.

Plot of Observed vs. Predicted Values. It is always a good idea to inspect a scatterplot of predicted vs. observed values. If the model is appropriate for the data, then we would expect the points to roughly follow a straight line; if the model is incorrectly specified, then this plot will indicate a non-linear pattern.

Normal and Half-Normal Probability Plots. The normal probability plot of residual will give us an indication of whether or not the residuals (i.e., errors) are normally distributed.

Plot of the Fitted Function. For models involving two or three variables (one or two predictors) it is useful to plot the fitted function using the final parameter estimates. Here is an example of a 3D plot with two predictor variables:

This type of plot represents the most direct visual check of whether or not a model fits the data, and whether there are apparent outliers.

Variance/Covariance Matrix for Parameters. When a model is grossly misspecified, or the estimation procedure gets “hung up” in a local minimum, the standard errors for the parameter estimates can become very large. This means that regardless of how the parameters were moved around the final values, the resulting loss function did not change much. Also, the correlations between parameters may become very large, indicating that parameters are very redundant; put another way, when the estimation algorithm moved one parameter away from the final value, then the increase in the loss function could be almost entirely compensated for by moving another parameter. Thus, the effect of those two parameters on the loss function was very redundant.

Statistics: Model Extremely Complex Functions, Neural Networks

Many concepts related to the neural networks methodology are best explained if they are illustrated with applications of a specific neural network program. Therefore, this section contains many references to STATISTICA Neural Networks, a particularly comprehensive neural networks application available from StatSoft.


Neural networks have seen an explosion of interest over the last few years, and are being successfully applied across an extraordinary range of problem domains, in areas as diverse as finance, medicine, engineering, geology and physics. Indeed, anywhere that there are problems of prediction, classification or control, neural networks are being introduced. This sweeping success can be attributed to a few key factors:

  • Power. Neural networks are very sophisticated modeling techniques capable of modeling extremely complex functions. In particular, neural networks are nonlinear (a term which is discussed in more detail later in this section). For many years linear modeling has been the commonly used technique in most modeling domains since linear models have well-known optimization strategies. Where the linear approximation was not valid (which was frequently the case) the models suffered accordingly. Neural networks also keep in check the curse of dimensionality problem that bedevils attempts to model nonlinear functions with large numbers of variables.
  • Ease of use. Neural networks learn by example. The neural network user gathers representative data, and then invokes training algorithms to automatically learn the structure of the data. Although the user does need to have some heuristic knowledge of how to select and prepare data, how to select an appropriate neural network, and how to interpret the results, the level of user knowledge needed to successfully apply neural networks is much lower than would be the case using (for example) some more traditional nonlinear statistical methods.

Neural networks are also intuitively appealing, based as they are on a crude low-level model of biological neural systems. In the future, the development of this neurobiological modeling may lead to genuinely intelligent computers.

 Applications for Neural Networks

Neural networks are applicable in virtually every situation in which a relationship between the predictor variables (independents, inputs) and predicted variables (dependents, outputs) exists, even when that relationship is very complex and not easy to articulate in the usual terms of “correlations” or “differences between groups.” A few representative examples of problems to which neural network analysis has been applied successfully are:

  • Detection of medical phenomena. A variety of health-related indices (e.g., a combination of heart rate, levels of various substances in the blood, respiration rate) can be monitored. The onset of a particular medical condition could be associated with a very complex (e.g., nonlinear and interactive) combination of changes on a subset of the variables being monitored. Neural networks have been used to recognize this predictive pattern so that the appropriate treatment can be prescribed.
  • Stock market prediction. Fluctuations of stock prices and stock indices are another example of a complex, multidimensional, but in some circumstances at least partially-deterministic phenomenon. Neural networks are being used by many technical analysts to make predictions about stock prices based upon a large number of factors such as past performance of other stocks and various economic indicators.
  • Credit assignment. A variety of pieces of information are usually known about an applicant for a loan. For instance, the applicant’s age, education, occupation, and many other facts may be available. After training a neural network on historical data, neural network analysis can identify the most relevant characteristics and use those to classify applicants as good or bad credit risks.
  • Monitoring the condition of machinery. Neural networks can be instrumental in cutting costs by bringing additional expertise to scheduling the preventive maintenance of machines. A neural network can be trained to distinguish between the sounds a machine makes when it is running normally (“false alarms”) versus when it is on the verge of a problem. After this training period, the expertise of the network can be used to warn a technician of an upcoming breakdown, before it occurs and causes costly unforeseen “downtime.”
  • Engine management. Neural networks have been used to analyze the input of sensors from an engine. The neural network controls the various parameters within which the engine functions, in order to achieve a particular goal, such as minimizing fuel consumption.

The Biological Inspiration

Neural networks grew out of research in Artificial Intelligence; specifically, attempts to mimic the fault-tolerance and capacity to learn of biological neural systems by modeling the low-level structure of the brain (see Patterson, 1996). The main branch of Artificial Intelligence research in the 1960s -1980s produced Expert Systems. These are based upon a high-level model of reasoning processes (specifically, the concept that our reasoning processes are built upon manipulation of symbols). It became rapidly apparent that these systems, although very useful in some domains, failed to capture certain key aspects of human intelligence. According to one line of speculation, this was due to their failure to mimic the underlying structure of the brain. In order to reproduce intelligence, it would be necessary to build systems with a similar architecture.

The brain is principally composed of a very large number (circa 10,000,000,000) of neurons, massively interconnected (with an average of several thousand interconnects per neuron, although this varies enormously). Each neuron is a specialized cell which can propagate an electrochemical signal. The neuron has a branching input structure (the dendrites), a cell body, and a branching output structure (the axon). The axons of one cell connect to the dendrites of another via a synapse. When a neuron is activated, it fires an electrochemical signal along the axon. This signal crosses the synapses to other neurons, which may in turn fire. A neuron fires only if the total signal received at the cell body from the dendrites exceeds a certain level (the firing threshold).

The strength of the signal received by a neuron (and therefore its chances of firing) critically depends on the efficacy of the synapses. Each synapse actually contains a gap, with neurotransmitter chemicals poised to transmit a signal across the gap. One of the most influential researchers into neurological systems (Donald Hebb) postulated that learning consisted principally in altering the “strength” of synaptic connections. For example, in the classic Pavlovian conditioning experiment, where a bell is rung just before dinner is delivered to a dog, the dog rapidly learns to associate the ringing of a bell with the eating of food. The synaptic connections between the appropriate part of the auditory cortex and the salivation glands are strengthened, so that when the auditory cortex is stimulated by the sound of the bell the dog starts to salivate. Recent research in cognitive science, in particular in the area of nonconscious information processing, have further demonstrated the enormous capacity of the human mind to infer (“learn”) simple input-output covariations from extremely complex stimuli (e.g., see Lewicki, Hill, and Czyzewska, 1992).

Thus, from a very large number of extremely simple processing units (each performing a weighted sum of its inputs, and then firing a binary signal if the total input exceeds a certain level) the brain manages to perform extremely complex tasks. Of course, there is a great deal of complexity in the brain which has not been discussed here, but it is interesting that artificial neural networks can achieve some remarkable results using a model not much more complex than this.


The Basic Artificial Model

To capture the essence of biological neural systems, an artificial neuron is defined as follows:

  • It receives a number of inputs (either from original data, or from the output of other neurons in the neural network). Each input comes via a connection that has a strength (or weight); these weights correspond to synaptic efficacy in a biological neuron. Each neuron also has a single threshold value. The weighted sum of the inputs is formed, and the threshold subtracted, to compose the activation of the neuron (also known as the post-synaptic potential, or PSP, of the neuron).
  • The activation signal is passed through an activation function (also known as a transfer function) to produce the output of the neuron.

If the step activation function is used (i.e., the neuron’s output is 0 if the input is less than zero, and 1 if the input is greater than or equal to 0) then the neuron acts just like the biological neuron described earlier (subtracting the threshold from the weighted sum and comparing with zero is equivalent to comparing the weighted sum to the threshold). Actually, the step function is rarely used in artificial neural networks, as will be discussed. Note also that weights can be negative, which implies that the synapse has an inhibitory rather than excitatory effect on the neuron: inhibitory neurons are found in the brain.

This describes an individual neuron. The next question is: how should neurons be connected together? If a network is to be of any use, there must be inputs (which carry the values of variables of interest in the outside world) and outputs (which form predictions, or control signals). Inputs and outputs correspond to sensory and motor nerves such as those coming from the eyes and leading to the hands. However, there also can be hidden neurons that play an internal role in the network. The input, hidden and output neurons need to be connected together.

The key issue here is feedback (Haykin, 1994). A simple network has a feedforward structure: signals flow from inputs, forwards through any hidden units, eventually reaching the output units. Such a structure has stable behavior. However, if the network is recurrent (contains connections back from later to earlier neurons) it can be unstable, and has very complex dynamics. Recurrent networks are very interesting to researchers in neural networks, but so far it is the feedforward structures that have proved most useful in solving real problems.

A typical feedforward network has neurons arranged in a distinct layered topology. The input layer is not really neural at all: these units simply serve to introduce the values of the input variables. The hidden and output layer neurons are each connected to all of the units in the preceding layer. Again, it is possible to define networks that are partially-connected to only some units in the preceding layer; however, for most applications fully-connected networks are better.

When the network is executed (used), the input variable values are placed in the input units, and then the hidden and output layer units are progressively executed. Each of them calculates its activation value by taking the weighted sum of the outputs of the units in the preceding layer, and subtracting the threshold. The activation value is passed through the activation function to produce the output of the neuron. When the entire network has been executed, the outputs of the output layer act as the output of the entire network.


Using a Neural Network

The previous section describes in simplified terms how a neural network turns inputs into outputs. The next important question is: how do you apply a neural network to solve a problem?

The type of problem amenable to solution by a neural network is defined by the way they work and the way they are trained. Neural networks work by feeding in some input variables, and producing some output variables. They can therefore be used where you have some known information, and would like to infer some unknown information (see Patterson, 1996; Fausett, 1994). Some examples are:

Stock market prediction. You know last week’s stock prices and today’s DOW, NASDAQ, or FTSE index; you want to know tomorrow’s stock prices.

Credit assignment. You want to know whether an applicant for a loan is a good or bad credit risk. You usually know applicants’ income, previous credit history, etc. (because you ask them these things).

Control. You want to know whether a robot should turn left, turn right, or move forwards in order to reach a target; you know the scene that the robot’s camera is currently observing.

Needless to say, not every problem can be solved by a neural network. You may wish to know next week’s lottery result, and know your shoe size, but there is no relationship between the two. Indeed, if the lottery is being run correctly, there is no fact you could possibly know that would allow you to infer next week’s result. Many financial institutions use, or have experimented with, neural networks for stock market prediction, so it is likely that any trends predictable by neural techniques are already discounted by the market, and (unfortunately), unless you have a sophisticated understanding of that problem domain, you are unlikely to have any success there either.

Therefore, another important requirement for the use of a neural network therefore is that you know (or at least strongly suspect) that there is a relationship between the proposed known inputs and unknown outputs. This relationship may be noisy (you certainly would not expect that the factors given in the stock market prediction example above could give an exact prediction, as prices are clearly influenced by other factors not represented in the input set, and there may be an element of pure randomness) but it must exist.

In general, if you use a neural network, you won’t know the exact nature of the relationship between inputs and outputs – if you knew the relationship, you would model it directly. The other key feature of neural networks is that they learn the input/output relationship through training. There are two types of training used in neural networks, with different types of networks using different types of training. These are supervised and unsupervised training, of which supervised is the most common and will be discussed in this section (unsupervised learning is described in a later section).

In supervised learning, the network user assembles a set of training data. The training data contains examples of inputs together with the corresponding outputs, and the network learns to infer the relationship between the two. Training data is usually taken from historical records. In the above examples, this might include previous stock prices and DOW, NASDAQ, or FTSE indices, records of previous successful loan applicants, including questionnaires and a record of whether they defaulted or not, or sample robot positions and the correct reaction.

The neural network is then trained using one of the supervised learning algorithms (of which the best known example is back propagation, devised by Rumelhart et. al., 1986), which uses the data to adjust the network’s weights and thresholds so as to minimize the error in its predictions on the training set. If the network is properly trained, it has then learned to model the (unknown) function that relates the input variables to the output variables, and can subsequently be used to make predictions where the output is not known.


Gathering Data for Neural Networks

Once you have decided on a problem to solve using neural networks, you will need to gather data for training purposes. The training data set includes a number of cases, each containing values for a range of input and output variables. The first decisions you will need to make are: which variables to use, and how many (and which) cases to gather.

The choice of variables (at least initially) is guided by intuition. Your own expertise in the problem domain will give you some idea of which input variables are likely to be influential. As a first pass, you should include any variables that you think could have an influence – part of the design process will be to whittle this set down.

Neural networks process numeric data in a fairly limited range. This presents a problem if data is in an unusual range, if there is missing data, or if data is non-numeric. Fortunately, there are methods to deal with each of these problems. Numeric data is scaled into an appropriate range for the network, and missing values can be substituted for using the mean value (or other statistic) of that variable across the other available training cases (see Bishop, 1995).

Handling non-numeric data is more difficult. The most common form of non-numeric data consists of nominal-value variables such as Gender={Male, Female}. Nominal-valued variables can be represented numerically. However, neural networks do not tend to perform well with nominal variables that have a large number of possible values.

For example, consider a neural network being trained to estimate the value of houses. The price of houses depends critically on the area of a city in which they are located. A particular city might be subdivided into dozens of named locations, and so it might seem natural to use a nominal-valued variable representing these locations. Unfortunately, it would be very difficult to train a neural network under these circumstances, and a more credible approach would be to assign ratings (based on expert knowledge) to each area; for example, you might assign ratings for the quality of local schools, convenient access to leisure facilities, etc.

Other kinds of non-numeric data must either be converted to numeric form, or discarded. Dates and times, if important, can be converted to an offset value from a starting date/time. Currency values can easily be converted. Unconstrained text fields (such as names) cannot be handled and should be discarded.

The number of cases required for neural network training frequently presents difficulties. There are some heuristic guidelines, which relate the number of cases needed to the size of the network (the simplest of these says that there should be ten times as many cases as connections in the network). Actually, the number needed is also related to the (unknown) complexity of the underlying function which the network is trying to model, and to the variance of the additive noise. As the number of variables increases, the number of cases required increases nonlinearly, so that with even a fairly small number of variables (perhaps fifty or less) a huge number of cases are required. This problem is known as “the curse of dimensionality,” and is discussed further later in this section.

For most practical problem domains, the number of cases required will be hundreds or thousands. For very complex problems more may be required, but it would be a rare (even trivial) problem which required less than a hundred cases. If your data is sparser than this, you really don’t have enough information to train a network, and the best you can do is probably to fit a linear model. If you have a larger, but still restricted, data set, you can compensate to some extent by forming an ensemble of networks, each trained using a different resampling of the available data, and then average across the predictions of the networks in the ensemble.

Many practical problems suffer from data that is unreliable: some variables may be corrupted by noise, or values may be missing altogether. Neural networks are also noise tolerant. However, there is a limit to this tolerance; if there are occasional outliers far outside the range of normal values for a variable, they may bias the training. The best approach to such outliers is to identify and remove them (either discarding the case, or converting the outlier into a missing value).


Choose variables that you believe may be influential.

Numeric and nominal variables can be handled. Convert other variables to one of these forms, or discard.

Hundreds or thousands of cases are required; the more variables, the more cases.

Cases with missing values can be used, if necessary, but outliers may cause problems – check your data. Remove outliers if possible. If you have sufficient data, discard cases with missing values.

If the volume of the data available is small, consider using ensembles and resampling.


Pre- and Post-Processing

All neural networks take numeric input and produce numeric output. The transfer function of a unit is typically chosen so that it can accept input in any range, and produces output in a strictly limited range (it has a squashing effect). Although the input can be in any range, there is a saturation effect so that the unit is only sensitive to inputs within a fairly limited range. The illustration below shows one of the most common transfer functions, the logistic function (also sometimes referred to as the sigmoid function, although strictly speaking it is only one example of a sigmoid – S-shaped – function). In this case, the output is in the range (0,1), and the input is sensitive in a range not much larger than (-1,+1). The function is also smooth and easily differentiable, facts that are critical in allowing the network training algorithms to operate (this is the reason why the step function is not used in practice).

[Neural Network Example]

The limited numeric response range, together with the fact that information has to be in numeric form, implies that neural solutions require preprocessing and post-processing stages to be used in real applications (see Bishop, 1995). Two issues need to be addressed:

Scaling. Numeric values have to be scaled into a range that is appropriate for the network. Typically, raw variable values are scaled linearly. In some circumstances, non-linear scaling may be appropriate (for example, if you know that a variable is exponentially distributed, you might take the logarithm). Non-linear scaling is not supported in ST Neural Networks. Instead, you should scale the variable using STATISTICA‘s data transformation facilities before transferring the data to ST Neural Networks.

Nominal variables. Nominal variables may be two-state (e.g., Gender={Male,Female}) or many-state (i.e., more than two states). A two-state nominal variable is easily represented by transformation into a numeric value (e.g., Male=0, Female=1). Many-state nominal variables are more difficult to handle. They can be represented using an ordinal encoding (e.g., Dog=0,Budgie=1,Cat=2) but this implies a (probably) false ordering on the nominal values – in this case, that Budgies are in some sense midway between Dogs and Cats. A better approach, known as one-of-N encoding, is to use a number of numeric variables to represent the single nominal variable. The number of numeric variables equals the number of possible values; one of the N variables is set, and the others cleared (e.g., Dog={1,0,0}, Budgie={0,1,0}, Cat={0,0,1}). ST Neural Networks has facilities to convert both two-state and many-state nominal variables for use in the neural network. Unfortunately, a nominal variable with a large number of states would require a prohibitive number of numeric variables for one-of-N encoding, driving up the network size and making training difficult. In such a case it is possible (although unsatisfactory) to model the nominal variable using a single numeric ordinal; a better approach is to look for a different way to represent the information.

Prediction problems may be divided into two main categories:

Classification. In classification, the objective is to determine to which of a number of discrete classes a given input case belongs. Examples include credit assignment (is this person a good or bad credit risk), cancer detection (tumor, clear), signature recognition (forgery, true). In all these cases, the output required is clearly a single nominal variable. The most common classification tasks are (as above) two-state, although many-state tasks are also not unknown.

Regression. In regression, the objective is to predict the value of a (usually) continuous variable: tomorrow’s stock price, the fuel consumption of a car, next year’s profits. In this case, the output required is a single numeric variable.

Neural networks can actually perform a number of regression and/or classification tasks at once, although commonly each network performs only one. In the vast majority of cases, therefore, the network will have a single output variable, although in the case of many-state classification problems, this may correspond to a number of output units (the post-processing stage takes care of the mapping from output units to output variables). If you do define a single network with multiple output variables, it may suffer from cross-talk (the hidden neurons experience difficulty learning, as they are attempting to model at least two functions at once). The best solution is usually to train separate networks for each output, then to combine them into an ensemble so that they can be run as a unit.


Multilayer Perceptrons

This is perhaps the most popular network architecture in use today, due originally to Rumelhart and McClelland (1986) and discussed at length in most neural network textbooks (e.g., Bishop, 1995). This is the type of network discussed briefly in previous sections: the units each perform a biased weighted sum of their inputs and pass this activation level through a transfer function to produce their output, and the units are arranged in a layered feedforward topology. The network thus has a simple interpretation as a form of input-output model, with the weights and thresholds (biases) the free parameters of the model. Such networks can model functions of almost arbitrary complexity, with the number of layers, and the number of units in each layer, determining the function complexity. Important issues in Multilayer Perceptrons (MLP) design include specification of the number of hidden layers and the number of units in these layers (see Haykin, 1994; Bishop, 1995).

The number of input and output units is defined by the problem (there may be some uncertainty about precisely which inputs to use, a point to which we will return later. However, for the moment we will assume that the input variables are intuitively selected and are all meaningful). The number of hidden units to use is far from clear. As good a starting point as any is to use one hidden layer, with the number of units equal to half the sum of the number of input and output units. Again, we will discuss how to choose a sensible number later.

Training Multilayer Perceptrons

Once the number of layers, and number of units in each layer, has been selected, the network’s weights and thresholds must be set so as to minimize the prediction error made by the network. This is the role of the training algorithms. The historical cases that you have gathered are used to automatically adjust the weights and thresholds in order to minimize this error. This process is equivalent to fitting the model represented by the network to the training data available. The error of a particular configuration of the network can be determined by running all the training cases through the network, comparing the actual output generated with the desired or target outputs. The differences are combined together by an error function to give the network error. The most common error functions are the sum squared error (used for regression problems), where the individual errors of output units on each case are squared and summed together, and the cross entropy functions (used for maximum likelihood classification).

In traditional modeling approaches (e.g., linear modeling) it is possible to algorithmically determine the model configuration that absolutely minimizes this error. The price paid for the greater (non-linear) modeling power of neural networks is that although we can adjust a network to lower its error, we can never be sure that the error could not be lower still.

A helpful concept here is the error surface. Each of the N weights and thresholds of the network (i.e., the free parameters of the model) is taken to be a dimension in space. The N+1th dimension is the network error. For any possible configuration of weights the error can be plotted in the N+1th dimension, forming an error surface. The objective of network training is to find the lowest point in this many-dimensional surface.

In a linear model with sum squared error function, this error surface is a parabola (a quadratic), which means that it is a smooth bowl-shape with a single minimum. It is therefore “easy” to locate the minimum.

Neural network error surfaces are much more complex, and are characterized by a number of unhelpful features, such as local minima (which are lower than the surrounding terrain, but above the global minimum), flat-spots and plateaus, saddle-points, and long narrow ravines.

It is not possible to analytically determine where the global minimum of the error surface is, and so neural network training is essentially an exploration of the error surface. From an initially random configuration of weights and thresholds (i.e., a random point on the error surface), the training algorithms incrementally seek for the global minimum. Typically, the gradient (slope) of the error surface is calculated at the current point, and used to make a downhill move. Eventually, the algorithm stops in a low point, which may be a local minimum (but hopefully is the global minimum).

The Back Propagation Algorithm

The best-known example of a neural network training algorithm is back propagation (see Patterson, 1996; Haykin, 1994; Fausett, 1994). In back propagation, the gradient vector of the error surface is calculated. This vector points along the line of steepest descent from the current point, so we know that if we move along it a “short” distance, we will decrease the error. A sequence of such moves (slowing as we near the bottom) will eventually find a minimum of some sort. The difficult part is to decide how large the steps should be.

Large steps may converge more quickly, but may also overstep the solution or (if the error surface is very eccentric) go off in the wrong direction. A classic example of this in neural network training is where the algorithm progresses very slowly along a steep, narrow, valley, bouncing from one side across to the other. In contrast, very small steps may go in the correct direction, but they also require a large number of iterations. In practice, the step size is proportional to the slope (so that the algorithms settles down in a minimum) and to a special constant: the learning rate. The correct setting for the learning rate is application-dependent, and is typically chosen by experiment; it may also be time-varying, getting smaller as the algorithm progresses.

The algorithm is also usually modified by inclusion of a momentum term: this encourages movement in a fixed direction, so that if several steps are taken in the same direction, the algorithm “picks up speed”, which gives it the ability to (sometimes) escape local minimum, and also to move rapidly over flat spots and plateaus.

The algorithm therefore progresses iteratively, through a number of epochs. On each epoch, the training cases are each submitted in turn to the network, and target and actual outputs compared and the error calculated. This error, together with the error surface gradient, is used to adjust the weights, and then the process repeats. The initial network configuration is random, and training stops when a given number of epochs elapses or when the error stops improving (you can select which of these stopping conditions to use).

Over-Learning and Generalization

One major problem with the approach outlined above is that it doesn’t actually minimize the error that we are really interested in – which is the expected error the network will make when new cases are submitted to it. In other words, the most desirable property of a network is its ability to generalize to new cases. In reality, the network is trained to minimize the error on the training set, and short of having a perfect and infinitely large training set, this is not the same thing as minimizing the error on the real error surface – the error surface of the underlying and unknown model (see Bishop, 1995).

The most important manifestation of this distinction is the problem of over-learning, or over-fitting. It is easiest to demonstrate this concept using polynomial curve fitting rather than neural networks, but the concept is precisely the same.

A polynomial is an equation with terms containing only constants and powers of the variables. For example:


Different polynomials have different shapes, with larger powers (and therefore larger numbers of terms) having steadily more eccentric shapes. Given a set of data, we may want to fit a polynomial curve (i.e., a model) to explain the data. The data is probably noisy, so we don’t necessarily expect the best model to pass exactly through all the points. A low-order polynomial may not be sufficiently flexible to fit close to the points, whereas a high-order polynomial is actually too flexible, fitting the data exactly by adopting a highly eccentric shape that is actually unrelated to the underlying function. See illustration below.

[Neural Network Example]

Neural networks have precisely the same problem. A network with more weights models a more complex function, and is therefore prone to over-fitting. A network with less weights may not be sufficiently powerful to model the underlying function. For example, a network with no hidden layers actually models a simple linear function.

How then can we select the right complexity of network? A larger network will almost invariably achieve a lower error eventually, but this may indicate over-fitting rather than good modeling.

The answer is to check progress against an independent data set, the test set. Some of the cases are reserved, and not actually used for training in the back propagation algorithm. Instead, they are used to keep an independent check on the progress of the algorithm. It is invariably the case that the initial performance of the network on training and test sets is the same (if it is not at least approximately the same, the division of cases between the two sets is probably biased). As training progresses, the training error naturally drops, and providing training is minimizing the true error function, the selection error drops too. However, if the selection error stops dropping, or indeed starts to rise, this indicates that the network is starting to overfit the data, and training should cease. When over-fitting occurs during the training process like this, it is called over-learning. In this case, it is usually advisable to decrease the number of hidden units and/or hidden layers, as the network is over-powerful for the problem at hand. In contrast, if the network is not sufficiently powerful to model the underlying function, over-learning is not likely to occur, and neither training nor selection errors will drop to a satisfactory level.

The problems associated with local minima, and decisions over the size of network to use, imply that using a neural network typically involves experimenting with a large number of different networks, probably training each one a number of times (to avoid being fooled by local minima), and observing individual performances. The key guide to performance here is the selection error. However, following the standard scientific precept that, all else being equal, a simple model is always preferable to a complex model, you can also select a smaller network in preference to a larger one with a negligible improvement in selection error.

A problem with this approach of repeated experimentation is that the test set plays a key role in selecting the model, which means that it is actually part of the training process. Its reliability as an independent guide to performance of the model is therefore compromised – with sufficient experiments, you may just hit upon a lucky network that happens to perform well on the test set. To add confidence in the performance of the final model, it is therefore normal practice (at least where the volume of training data allows it) to reserve a third set of cases – the validation set. The final model is tested with the validation set data, to ensure that the results on the testing and training set are real, and not artifacts of the training process. Of course, to fulfill this role properly, the validation set should be used only once – if it is in turn used to adjust and reiterate the training process, it effectively becomes selection data.

This division into multiple subsets is very unfortunate, given that we usually have less data than we would ideally desire even for a single subset. We can get around this problem by resampling. Experiments can be conducted using different divisions of the available data into training, testing, and validation sets. There are a number of approaches to this subset, including random (monte-carlo) resampling and bootstrap. If we make design decisions, such as the best configuration of neural network to use, based upon a number of experiments with different subset examples, the results will be much more reliable. We can then either use those experiments solely to guide the decision as to which network types to use, and train such networks from scratch with new samples (this removes any sampling bias); or, we can retain the best networks found during the sampling process, but average their results in an ensemble, which at least mitigates the sampling bias.

To summarize, network design (once the input variables have been selected) follows a number of stages:

  • Select an initial configuration (typically, one hidden layer with the number of hidden units set to half the sum of the number of input and output units).
  • Iteratively conduct a number of experiments with each configuration, retaining the best network (in terms of selection error) found. A number of experiments are required with each configuration to avoid being fooled if training locates a local minimum, and it is also best to resample.
  • On each experiment, if under-learning occurs (the network doesn’t achieve an acceptable performance level) try adding more neurons to the hidden layer(s). If this doesn’t help, try adding an extra hidden layer.
  • If over-learning occurs (selection error starts to rise) try removing hidden units.
  • Once you have experimentally determined an effective configuration for your networks, resample and generate new networks with that configuration.

Data Selection

All the above stages rely on a key assumption. Specifically, the training, verification and test data must be representative of the underlying model (and, further, the three sets must be independently representative). The old computer science adage “garbage in, garbage out” could not apply more strongly than in neural modeling. If training data is not representative, then the model’s worth is at best compromised. At worst, it may be useless. It is worth spelling out the kind of problems which can corrupt a training set:

The future is not the past. Training data is typically historical. If circumstances have changed, relationships which held in the past may no longer hold.

All eventualities must be covered. A neural network can only learn from cases that are present. If people with incomes over $100,000 per year are a bad credit risk, and your training data includes nobody over $40,000 per year, you cannot expect it to make a correct decision when it encounters one of the previously-unseen cases. Extrapolation is dangerous with any model, but some types of neural network may make particularly poor predictions in such circumstances.

A network learns the easiest features it can. A classic (possibly apocryphal) illustration of this is a vision project designed to automatically recognize tanks. A network is trained on a hundred pictures including tanks, and a hundred not. It achieves a perfect 100% score. When tested on new data, it proves hopeless. The reason? The pictures of tanks are taken on dark, rainy days; the pictures without on sunny days. The network learns to distinguish the (trivial matter of) differences in overall light intensity. To work, the network would need training cases including all weather and lighting conditions under which it is expected to operate – not to mention all types of terrain, angles of shot, distances…

Unbalanced data sets. Since a network minimizes an overall error, the proportion of types of data in the set is critical. A network trained on a data set with 900 good cases and 100 bad will bias its decision toward good cases, as this allows the algorithm to lower the overall error (which is much more heavily influenced by the good cases). If the representation of good and bad cases is different in the real population, the network’s decisions may be wrong. A good example would be disease diagnosis. Perhaps 90% of patients routinely tested are clear of a disease. A network is trained on an available data set with a 90/10 split. It is then used in diagnosis on patients complaining of specific problems, where the likelihood of disease is 50/50. The network will react over-cautiously and fail to recognize disease in some unhealthy patients. In contrast, if trained on the “complainants” data, and then tested on “routine” data, the network may raise a high number of false positives. In such circumstances, the data set may need to be crafted to take account of the distribution of data (e.g., you could replicate the less numerous cases, or remove some of the numerous cases) (Bishop, 1995). Often, the best approach is to ensure even representation of different cases, then to interpret the network’s decisions accordingly.

Insights into MLP Training

More key insights into MLP behavior and training can be gained by considering the type of functions they model. Recall that the activation level of a unit is the weighted sum of the inputs, plus a threshold value. This implies that the activation level is actually a simple linear function of the inputs. The activation is then passed through a sigmoid (S-shaped) curve. The combination of the multi-dimensional linear function and the one-dimensional sigmoid function gives the characteristic sigmoid cliff response of a first hidden layer MLP unit (the figure below illustrates the shape plotted across two inputs. An MLP unit with more inputs has a higher-dimensional version of this functional shape). Altering the weights and thresholds alters this response surface. In particular, both the orientation of the surface, and the steepness of the sloped section, can be altered. A steep slope corresponds to large weight values: doubling all weight values gives the same orientation but a different slope.

[Neural Network Example]

The next figure illustrates a typical response surface for a network with only one hidden layer, of two units, and a single output unit, on the classic XOR problem. Two separate sigmoid surfaces have been combined into a single U-shaped surface.

During network training, the weights and thresholds are first initialized to small, random values. This implies that the units’ response surfaces are each aligned randomly with low slope: they are effectively uncommitted. As training progresses, the units’ response surfaces are rotated and shifted into appropriate positions, and the magnitudes of the weights grow as they commit to modeling particular parts of the target response surface.

In a classification problem, an output unit’s task is to output a strong signal if a case belongs to its class, and a weak signal if it doesn’t. In other words, it is attempting to model a function that has magnitude one for parts of the pattern-space that contain its cases, and magnitude zero for other parts.

[Neural Network Example]

This is known as a discriminant function in pattern recognition problems. An ideal discriminant function could be said to have a plateau structure, where all points on the function are either at height zero or height one.

If there are no hidden units, then the output can only model a single sigmoid-cliff with areas to one side at low height and areas to the other high. There will always be a region in the middle (on the cliff) where the height is in-between, but as weight magnitudes are increased, this area shrinks.

A sigmoid-cliff like this is effectively a linear discriminant. Points to one side of the cliff are classified as belonging to the class, points to the other as not belonging to it. This implies that a network with no hidden layers can only classify linearly-separable problems (those where a line – or, more generally in higher dimensions, a hyperplane – can be drawn which separates the points in pattern space).

A network with a single hidden layer has a number of sigmoid-cliffs (one per hidden unit) represented in that hidden layer, and these are in turn combined into a plateau in the output layer. The plateau has a convex hull (i.e., there are no dents in it, and no holes inside it). Although the plateau is convex, it may extend to infinity in some directions (like an extended peninsular). Such a network is in practice capable of modeling adequately most real-world classification problems.

[Neural Network Example]

The figure above shows the plateau response surface developed by an MLP to solve the XOR problem: as can be seen, this neatly sections the space along a diagonal.

A key question in classification is how to interpret points on or near the cliff. The standard practice is to adopt some confidence levels (the accept and reject thresholds) that must be exceeded before the unit is deemed to have made a decision. For example, if accept/reject thresholds of 0.95/0.05 are used, an output unit with an output level in excess of 0.95 is deemed to be on, below 0.05 it is deemed to be off, and in between it is deemed to be undecided.

A more subtle (and perhaps more useful) interpretation is to treat the network outputs as probabilities. In this case, the network gives more information than simply a decision: it tells us how sure (in a formal sense) it is of that decision. There are modifications to MLPs that allow the neural network outputs to be interpreted as probabilities, which means that the network effectively learns to model the probability density function of the class. However, the probabilistic interpretation is only valid under certain assumptions about the distribution of the data (specifically, that it is drawn from the family of exponential distributions; see Bishop, 1995). Ultimately, a classification decision must still be made, but a probabilistic interpretation allows a more formal concept of minimum cost decision making to be evolved.

Other MLP Training Algorithms

Neural networks are highly nonlinear tools that are usually trained using iterative techniques. The most recommended techniques for training neural networks are the BFGS (Broyden-Fletcher-Goldfarb-Shanno) and Scaled Conjugate Gradient algorithms (see Bishop 1995). These methods perform significantly better than the more traditional algorithms such as Gradient Descent but they are, generally speaking, more memory intensive and computationally demanding. Nonetheless, these techniques may require a smaller number of iterations to train a neural network given their fast convergence rate and more intelligent search criterion.

Radial Basis Function Networks

We have seen in the last section how an MLP models the response function using the composition of sigmoid-cliff functions – for a classification problem, this corresponds to dividing the pattern space up using hyperplanes. The use of hyperplanes to divide up space is a natural approach – intuitively appealing, and based on the fundamental simplicity of lines.

An equally appealing and intuitive approach is to divide up space using circles or (more generally) hyperspheres. A hypersphere is characterized by its center and radius. More generally, just as an MLP unit responds (non-linearly) to the distance of points from the line of the sigmoid-cliff, in a radial basis function network (Broomhead and Lowe, 1988; Moody and Darkin, 1989; Haykin, 1994) units respond (non-linearly) to the distance of points from the center represented by the radial unit. The response surface of a single radial unit is therefore a Gaussian (bell-shaped) function, peaked at the center, and descending outwards. Just as the steepness of the MLP’s sigmoid curves can be altered, so can the slope of the radial unit’s Gaussian. See the next illustration below.

[Neural Network Example]

MLP units are defined by their weights and threshold, which together give the equation of the defining line, and the rate of fall-off of the function from that line. Before application of the sigmoid activation function, the activation level of the unit is determined using a weighted sum, which mathematically is the dot product of the input vector and the weight vector of the unit; these units are therefore referred to as dot product units. In contrast, a radial unit is defined by its center point and a radius. A point in N dimensional space is defined using N numbers, which exactly corresponds to the number of weights in a dot product unit, so the center of a radial unit is stored as weights. The radius (or deviation) value is stored as the threshold. It is worth emphasizing that the weights and thresholds in a radial unit are actually entirely different to those in a dot product unit, and the terminology is dangerous if you don’t remember this: Radial weights really form a point, and a radial threshold is really a deviation.

A radial basis function network (RBF), therefore, has a hidden layer of radial units, each actually modeling a Gaussian response surface. Since these functions are nonlinear, it is not actually necessary to have more than one hidden layer to model any shape of function: sufficient radial units will always be enough to model any function. The remaining question is how to combine the hidden radial unit outputs into the network outputs? It turns out to be quite sufficient to use a linear combination of these outputs (i.e., a weighted sum of the Gaussians) to model any nonlinear function. The standard RBF has an output layer containing dot product units with identity activation function (see Haykin, 1994; Bishop, 1995).

RBF networks have a number of advantages over MLPs. First, as previously stated, they can model any nonlinear function using a single hidden layer, which removes some design-decisions about numbers of layers. Second, the simple linear transformation in the output layer can be optimized fully using traditional linear modeling techniques, which are fast and do not suffer from problems such as local minima which plague MLP training techniques. RBF networks can therefore be trained extremely quickly (i.e., orders of magnitude faster than MLPs).

On the other hand, before linear optimization can be applied to the output layer of an RBF network, the number of radial units must be decided, and then their centers and deviations must be set. Although faster than MLP training, the algorithms to do this are equally prone to discover sub-optimal combinations. Other features that distinguish RBF performance from MLPs are due to the differing approaches to modeling space, with RBFs “clumpy” and MLPs “planey.”

Other features which distinguish RBF performance from MLPs are due to the differing approaches to modeling space, with RBFs “clumpy” and MLPs “planey.”

Experience indicates that the RBF’s more eccentric response surface requires a lot more units to adequately model most functions. Of course, it is always possible to draw shapes that are most easily represented one way or the other, but the balance does not favor RBFs. Consequently, an RBF solution will tend to be slower to execute and more space consuming than the corresponding MLP (but it was much faster to train, which is sometimes more of a constraint).

The clumpy approach also implies that RBFs are not inclined to extrapolate beyond known data: the response drops off rapidly towards zero if data points far from the training data are used. Often the RBF output layer optimization will have set a bias level, hopefully more or less equal to the mean output level, so in fact the extrapolated output is the observed mean – a reasonable working assumption. In contrast, an MLP becomes more certain in its response when far-flung data is used. Whether this is an advantage or disadvantage depends largely on the application, but on the whole the MLP’s uncritical extrapolation is regarded as a bad point: extrapolation far from training data is usually dangerous and unjustified.

RBFs are also more sensitive to the curse of dimensionality, and have greater difficulties if the number of input units is large: this problem is discussed further in a later section.

As mentioned earlier, training of RBFs takes place in distinct stages. First, the centers and deviations of the radial units must be set; then the linear output layer is optimized.

Once centers and deviations have been set, the output layer can be optimized using the standard linear optimization technique: the pseudo-inverse (singular value decomposition) algorithm (Haykin, 1994; Golub and Kahan, 1965).

However, RBFs as described above suffer similar problems to Multilayer Perceptrons if they are used for classification – the output of the network is a measure of distance from a decision hyperplane, rather than a probabilistic confidence level. We may therefore choose to modify the RBF by including an output layer with logistic or softmax (normalized exponential) outputs, which is capable of probability estimation. We lose the advantage of fast linear optimization of the output layer; however, the non-linear output layer still has a relatively well-behaved error surface, and can be optimized quite quickly using a fast iterative algorithm such as conjugate gradient descent.




SOFM Networks

Self Organizing Feature Map (SOFM, or Kohonen) networks are used quite differently to the other networks. Whereas all the other networks are designed for supervised learning tasks, SOFM networks are designed primarily for unsupervised learning (see Kohonen, 1982; Haykin, 1994; Patterson, 1996; Fausett, 1994).

Whereas in supervised learning the training data set contains cases featuring input variables together with the associated outputs (and the network must infer a mapping from the inputs to the outputs), in unsupervised learning the training data set contains only input variables.

At first glance this may seem strange. Without outputs, what can the network learn? The answer is that the SOFM network attempts to learn the structure of the data.

One possible use is therefore in exploratory data analysis. The SOFM network can learn to recognize clusters of data, and can also relate similar classes to each other. The user can build up an understanding of the data, which is used to refine the network. As classes of data are recognized, they can be labeled, so that the network becomes capable of classification tasks. SOFM networks can also be used for classification when output classes are immediately available – the advantage in this case is their ability to highlight similarities between classes.

A second possible use is in novelty detection. SOFM networks can learn to recognize clusters in the training data, and respond to it. If new data, unlike previous cases, is encountered, the network fails to recognize it and this indicates novelty.

A SOFM network has only two layers: the input layer, and an output layer of radial units (also known as the topological map layer). The units in the topological map layer are laid out in space – typically in two dimensions (although ST Neural Networks also supports one-dimensional Kohonen networks).

SOFM networks are trained using an iterative algorithm. Starting with an initially-random set of radial centers, the algorithm gradually adjusts them to reflect the clustering of the training data.

The iterative training procedure also arranges the network so that units representing centers close together in the input space are also situated close together on the topological map. You can think of the network’s topological layer as a crude two-dimensional grid, which must be folded and distorted into the N-dimensional input space, so as to preserve as far as possible the original structure. Clearly any attempt to represent an N-dimensional space in two dimensions will result in loss of detail; however, the technique can be worthwhile in allowing the user to visualize data which might otherwise be impossible to understand.

The basic iterative Kohonen algorithm simply runs through a number of epochs, on each epoch executing each training case and applying the following algorithm:

  • Select the winning neuron (the one who’s center is nearest to the input case);
  • Adjust the winning neuron to be more like the input case (a weighted sum of the old neuron center and the training case).

The algorithm uses a time-decaying learning rate, which is used to perform the weighted sum and ensures that the alterations become more subtle as the epochs pass. This ensures that the centers settle down to a compromise representation of the cases which cause that neuron to win.

The topological ordering property is achieved by adding the concept of a neighborhood to the algorithm. The neighborhood is a set of neurons surrounding the winning neuron. The neighborhood, like the learning rate, decays over time, so that initially quite a large number of neurons belong to the neighborhood (perhaps almost the entire topological map); in the latter stages the neighborhood will be zero (i.e., consists solely of the winning neuron itself). In the Kohonen algorithm, the adjustment of neurons is actually applied not just to the winning neuron, but to all the members of the current neighborhood.

The effect of this neighborhood update is that initially quite large areas of the network are “dragged towards” training cases – and dragged quite substantially. The network develops a crude topological ordering, with similar cases activating clumps of neurons in the topological map. As epochs pass the learning rate and neighborhood both decrease, so that finer distinctions within areas of the map can be drawn, ultimately resulting in fine-tuning of individual neurons. Often, training is deliberately conducted in two distinct phases: a relatively short phase with high learning rates and neighborhood, and a long phase with low learning rate and zero or near-zero neighborhood.

Once the network has been trained to recognize structure in the data, it can be used as a visualization tool to examine the data. The Win Frequencies (counts of the number of times each neuron wins when training cases are executed) can be examined to see if distinct clusters have formed on the map. Individual cases are executed and the topological map observed, to see if some meaning can be assigned to the clusters (this usually involves referring back to the original application area, so that the relationship between clustered cases can be established). Once clusters are identified, neurons in the topological map are labeled to indicate their meaning (sometimes individual cases may be labeled, too). Once the topological map has been built up in this way, new cases can be submitted to the network. If the winning neuron has been labeled with a class name, the network can perform classification. If not, the network is regarded as undecided. Datasheet

SOFM networks are inspired by some known properties of the brain. The cerebral cortex is actually a large flat sheet (about 0.5m squared; it is folded up into the familiar convoluted shape only for convenience in fitting into the skull!) with known topological properties (for example, the area corresponding to the hand is next to the arm, and a distorted human frame can be topologically mapped out in two dimensions on its surface).


Classification in Neural Networks

In classification problems, the purpose of the network is to assign each case to one of a number of classes (or, more generally, to estimate the probability of membership of the case in each class). Nominal output variables are used to indicate a classification problem. The nominal values correspond to the various classes.

Nominal variables are normally represented in networks using one of two techniques, the first of which is only available for two-state variables; these techniques are: two-state, one-of-N. In two-state representation, a single node corresponds to the variable, and a value of 0.0 is interpreted as one state, and a value of 1.0 as the other. In one-of-N encoding, one unit is allocated for each state, with a particular state represented by 1.0 on that particular unit, and 0.0 on the others.

Input nominal variables are easily converted using the above methods, both during training and during execution. Target outputs for units corresponding to nominal variables are also easily determined during training. However, more effort is required to determine the output class assigned by a network during execution.

The output units each have continuous activation values between 0.0 and 1.0. In order to definitely assign a class from the outputs, the network must decide if the outputs are reasonably close to 0.0 and 1.0. If they are not, the class is regarded as undecided.

In an SOFM network, the winning node in the topological map (output) layer is the one with the lowest activation level (which measures the distance of the input case from the point stored by the unit). If the distance is small enough, the case is assigned to the class.

Classification Statistics

When selecting accept/reject thresholds, and assessing the classification ability of the network, the most important indicator is the classification summary spreadsheet. This shows how many cases were correctly classified, incorrectly classified, or unclassified. You can also use the confusion matrix spreadsheet to break down how many cases belonging to each class were assigned to another class. All these figures can be independently reported for the training, testing, and validation sets.


Regression Problems in Neural Networks

In regression problems, the objective is to estimate the value of a continuous output variable, given the known input variables. Regression problems are represented by data sets with non-nominal (standard numeric) output(s).

A particularly important issue in regression is output scaling, and extrapolation effects.

The most common neural network architectures have outputs in a limited range (e.g., (0,1) for the logistic activation function). This presents no difficulty for classification problems, where the desired output is in such a range. However, for regression problems there clearly is an issue to be resolved, and some of the consequences are quite subtle.

This subject is discussed below.

As a first pass, we can apply a scaling algorithm to ensure that the network’s output will be in a sensible range. The simplest scaling function is minimax: this finds the minimum and maximum values of a variable in the training data, and performs a linear transformation (using a shift and a scale factor) to convert the values into the target range (typically [0.0,1.0]). If this is used on a continuous output variable, then we can guarantee that all training values will be converted into the range of possible outputs of the network, and so the network can be trained. We also know that the network’s output will be constrained to lie within this range. This may or may not be regarded as a good thing, which brings us to the subject of extrapolation.

[Neural Network Example]

Consider the figure above. Here, we are trying to estimate the value of y from the value of x. A curve has to be fitted that passes through the available data points. We can probably easily agree on the illustrated curve, which is approximately the right shape, and this will allow us to estimate y given inputs in the range represented by the solid line where we can interpolate.

However, what about a point well to the right of the data points? There are two possible approaches to estimating y for this point. First, we might decide to extrapolate: projecting the trend of the fitted curve onwards. Second, we might decide that we don’t really have sufficient evidence to assign any value, and therefore assign the mean output value (which is probably the best estimate we have lacking any other evidence).

Let’s assume that we are using an MLP. Using minimax as suggested above is highly restrictive. First, the curve is not extrapolated, however close to the training data we may be (if we are only a little bit outside the training data, extrapolation may well be justified). Second, it does not estimate the mean either – it actually saturates at either the minimum or maximum, depending on whether the estimated curve was rising or falling as it approached this region.

We can replace the logistic output activation function with a linear activation function, which simply passes on the activation level unchanged (N.B. only the activation functions in the output layer are changed; the hidden layers still use logistic or hyperbolic activation functions). The linear activation function does not saturate, and so can extrapolate further (the network will still saturate eventually as the hidden units saturate). A linear activation function in an MLP can cause some numerical difficulties for the back propagation algorithm, however, and if this is used a low learning rate (below 0.1) must be used. This approach may be appropriate if you want to extrapolate.

The above discussion focused on the performance of MLPs in regression, and particularly their behavior with respect to extrapolation. Networks using radial units (RBFs) perform quite differently, and need different treatment.

Radial networks are inherently incapable of extrapolation. As the input case gets further from the points stored in the radial units, so the activation of the radial units decays and (ultimately) the output of the network decays. An input case located far from the radial centers will generate a zero output from all hidden units. The tendency not to extrapolate can be regarded as good (depending on your problem-domain and viewpoint), but the tendency to decay to a zero output (at first sight) is not. If we decide to eschew extrapolation, then what we would like to see reported at highly novel input points is the mean. In fact, the RBF has a bias value on the output layer, and sets this to a convenient value, which hopefully approximates the sample mean. Then, the RBF will always output the mean if asked to extrapolate.

The performance of a regression network can be examined in a number of ways.

  1. The output of the network for each case (or any new case you choose to test) can be submitted to the network. If part of the data set, the residual errors can also be generated.
  2. Summary statistics can be generated. These include the mean and standard deviation of both the training data values and the prediction error. We would generally expect to see a prediction error mean extremely close to zero (it is, after all, possible to get a zero prediction error mean simply by estimating the mean training data value, without any recourse to the input variables or a neural network at all).
    The regression statistics also include the Pearson-R correlation coefficient between the network’s prediction and the observed values. In linear modeling, the Pearson-R correlation between the predictor variable and the predicted is often used to express correlation – if a linear model is fitted, this is identical to the correlation between the model’s prediction and the observed values (or, to the negative of it). Thus, this gives you a convenient way to compare the neural network’s accuracy with that of your linear models.

Time Series Prediction in Neural Networks

In time series problems, the objective is to predict the value of a variable that varies in time, using previous values of that and/or other variables (see Bishop, 1995)

Typically, the predicted variable is continuous, so time series prediction is usually a specialized form of regression. However, without this restriction, time series can also perform prediction of nominal variables (i.e., classification).

It is also usual to predict the next value in a series from a fixed number of previous values (looking ahead a single time step). When the next value in a series is generated, further values can be estimated by feeding the newly-estimated value back into the network together with other previous values: time series projection. Obviously, the reliability of projection drops the more steps ahead we try to predict, and if a particular distance ahead is required, it is probably better to train a network specifically for that degree of lookahead.

Any type of network can be used for time series prediction (the network type must, however, be appropriate for regression or classification, depending on the problem type). The network can also have any number of input and output variables. However, most commonly there is a single variable that is both the input and (with the lookahead taken into account) the output. Configuring a network for time series usage alters the way that data is pre-processed (i.e., it is drawn from a number of sequential cases, rather than a single case), but the network is executed and trained just as for any other problem.

The time series training data set therefore typically has a single variable, and this has type input/output (i.e., it is used both for network input and network output).

The most difficult concept in time series handling is the interpretation of training, testing, validation, and ignored cases. For standard data sets, each case is independent, and these meanings are clear. However, with a time series network, each pattern of inputs and outputs is actually drawn from a number of cases, determined by the network’s Steps and Lookahead parameters. There are two consequences of this:

The input pattern’s type is taken from the type of the output case. For example, in a data set, the first two cases are ignored and the third is test, with Steps=2 and Lookahead=1, the first usable pattern has type Test, and draws its inputs from the first two cases, and its output from the third. Thus, the first two cases are used in the test set even though they are marked Ignore. Further, any given case may be used in three patterns, and these may be any of training, testing, and validation patterns. In some sense, data actually leaks between training, testing, and validation sets. To isolate the three sets entirely, contiguous blocks of train, verify, or test cases would need to be constructed, separated by the appropriate number of ignore cases.

The first few cases can only be used as inputs for patterns. When selecting cases for time series use, the case number selected is always the output case. The first few clearly cannot be selected (as this would require further cases before the beginning of the data set), and are not available.


Variable Selection and Dimensionality Reduction

The preceding sections on network design and training have all assumed that the input and output layers are fixed; that is, that we know what variables will be input to the network, and what output is expected. The latter is always (at least, for supervised learning problems) known. However, the selection of inputs is far more difficult (see Bishop, 1995). Often, we do not know which of a set of candidate input variables are actually useful, and the selection of a good set of inputs is complicated by a number of important considerations:

Curse of dimensionality. Each additional input unit in a network adds another dimension to the space in which the data cases reside. We are attempting to fit a response surface to this data. Thought of in this way, there must be sufficient data points to populate an N dimensional space sufficiently densely to be able to see the structure. The number of points needed to do this properly grows very rapidly with the dimensionality (roughly, in proportion to 2N for most modelling techniques). Most forms of neural network (in particular, MLPs) actually suffer less from the curse of dimensionality than some other methods, as they can concentrate on a lower-dimensional section of the high-dimensional space (for example, by setting the outgoing weights from a particular input to zero, an MLP can entirely ignore that input). Nevertheless, the curse of dimensionality is still a problem, and the performance of a network can certainly be improved by eliminating unnecessary input variables. Indeed, even input variables that carry a small amount of information may sometimes be better eliminated if this reduces the curse of dimensionality.

Inter-dependency of variables. It would be extremely useful if each candidate input variable could be independently assessed for usefulness, so that the most useful ones could be extracted. Unfortunately, it is seldom possible to do this, and two or more interdependent variables may together carry significant information that a subset would not. A classic example is the two-spirals problem, where two classes of data are laid out in an interlocking spiral pattern in two dimensions. Either variable alone carries no useful information (the two classes appear wholly intermixed), but with the two variables together the two classes can be perfectly distinguished. Thus, variables cannot, in general, be independently selected.

Redundancy of variables. Often a number of variables can carry to some extent or other the same information. For example, the height and weight of people might in many circumstances carry similar information, as these two variables are correlated. It may be sufficient to use as inputs some subset of the correlated variables, and the choice of subset may be arbitrary. The superiority of a subset of correlated variables over the full set is a consequence of the curse of dimensionality.

Selection of input variables is therefore a critical part of neural network design. You can use a combination of your own expert knowledge of the problem domain, and standard statistical tests to make some selection of variables before starting to use Neural Networks. Once you begin using Neural Networks, various combinations of inputs can be tried. You can experimentally add and remove various combinations, building new networks for each. You can also conduct Sensitivity Analysis, which rates the importance of variable with respect to a particular model.

Another approach to dealing with dimensionality problems, which may be an alternative or a complement to variable selection, is dimensionality reduction. In dimensionality reduction, the original set of variables is processed to produce a new and smaller set of variables that contains (we hope) as much information as possible from the original set. As an example, consider a data set where all the points lie on a plane in a three dimensional space. The intrinsic dimensionality of the data is said to be two (as all the information actually resides in a two-dimensional sub-space). If this plane can be discovered, the neural network can be presented with a lower dimensionality input, and stands a better chance of working correctly.


Ensembles and Resampling

We have already discussed the problem of over-learning, which can compromise the ability of neural networks to generalize successfully to new data. An important approach to improve performance is to form ensembles of neural networks. The member networks’ predictions are averaged (or combined by voting) to form the ensemble’s prediction. Frequently, ensemble formation is combined with resampling of the data set. This approach can significantly improve generalization performance. Resampling can also be useful for improved estimation of network generalization performance.

To explain why resampling and ensembles are so useful, it is helpful to formulate the neural network training process in statistical terms (Bishop, 1995). We regard the problem as that of estimating an unknown nonlinear function, which has additive noise, on the basis of a limited data set of examples, D. There are several sources of error in our neural network’s predictions. First, and unavoidably, even a “perfect” network that exactly modeled the underlying function would make errors due to the noise. However, there is also error due to the fact that we need to fit the neural network model using the finite sample data set, D. This remaining error can be split into two components, the model bias and variance. The bias is the average error that a particular model training procedure will make across different particular data sets (drawn from the unknown function’s distribution). The variance reflects the sensitivity of the modeling procedure to a particular choice of data set.

We can trade off bias versus variance. At one extreme, we can arbitrarily select a function that entirely ignores the data. This has zero variance, but presumably high bias, since we have not actually taken into account the known aspects of the problem at all. At the opposite extreme, we can choose a highly complex function that can fit every point in a particular data set, and thus has zero bias, but high variance as this complex function changes shape radically to reflect the exact points in a given data set. The high bias, low variance solutions can have low complexity (e.g., linear models), whereas the low bias, high variance solutions have high complexity. In neural networks, the low complexity models have smaller numbers of units.

How does this relate to ensembles and resampling? We necessarily divide the data set into subsets for training, testing, and validation. Intuitively, this is a shame, as not all the data gets used for training. If we resample, using a different split of data each time, we can build multiple neural networks, and all the data gets used for training at least some of them. If we then form the networks into an ensemble, and average the predictions, an extremely useful result occurs. Averaging across the models reduces the variance, without increasing the bias. Arguably, we can afford to build higher bias models than we would otherwise tolerate (i.e., higher complexity models), on the basis that ensemble averaging can then mitigate the resulting variance.

The generalization performance of an ensemble can be better than that of the best member network, although this does depend on how good the other networks in the ensemble are. Unfortunately, it is not possible to show whether this is actually the case for a given ensemble. However, there are some reassuring pieces of theory to back up the use of ensembles.

First, it can be shown (Bishop, 1995) that, on the assumption that the ensemble members’ errors have zero mean and are uncorrelated, the ensemble reduces the error by a factor of N, where N is the number of members. In practice, of course, these errors are not uncorrelated. An important corollary is that an ensemble is more effective when the members are less correlated, and we might intuitively expect that to be the case if diverse network types and structures are used.

Second, and perhaps more significantly, it can be shown that the expected error of the ensemble is at least as good as the average expected error of the members, and usually better. Typically, some useful reduction in error does occur. There is of course a cost in processing speed, but for many applications this is not particularly problematic.

There are a number of approaches to resampling available.

The simplest approach is random (monte carlo) resampling, where the training, testing, and validation sets are simply drawn at random from the data set, keeping the sizes of the subsets constant. Alternatively, you CAN sometimes resample the training and test set, but keep the validation set the same, to support a simple direct comparison of results.

The second approach is bootstrap sampling. In the bootstrap, a new training set is formed by sampling with replacement from the available data set. In sampling with replacement, cases are drawn at random from the data set, with equal probability, and any one case may be selected any number of times. Typically the bootstrap set has the same number of cases as the data set, although this is not a necessity. Due to the sampling process, it is likely that some of the original cases will not be selected, and these can be used to form a test set, whereas other cases will have been duplicated.

The bootstrap procedure replicates, insofar as is possible with limited data, the idea of drawing multiple data sets from the original distribution. Once again, the effect can be to generate a number of models with low bias, and to average out the variance. Ensembles can also be beneficial at averaging out bias. If we include different network types and configurations in an ensemble, it may be that different networks make systematic errors in different parts of the input space. Averaging these differently configured networks may iron out some of this bias.

Considering Alternatives to SAS?

Do you use SAS for predictive modeling, advanced analytics, business intelligence, insurance or financial applications, or data visualization?


SAS software is expensive and carries high, unpredictable annual licensing costs. SAS software is difficult to use, requiring specific SAS programming expertise, and it drives users toward dependency on only SAS-specific solutions (e.g., their proprietary data warehouses). Data visualization is integral for analytics, but SAS’s graphics have major shortcomings.

STATISTICA has consistently been ranked the highest in ease of use and customer satisfaction in independent surveys of analytics professionals. Click here to see the results of the most recent Rexer survey (2010), the largest survey of data mining professionals in the industry.


We offer the breadth of analytics capabilities and performance, including the most comprehensive data mining solution on the market, using more open, modern technologies. StatSoft software is designed to facilitate interfacing with all industry standard components of your computer infrastructure (e.g., ultra-fast integration with Oracle, MS SQL Server, and other databases) instead of locking you into proprietary standards and total dependence on one vendor.

STATISTICA is significantly faster than SAS. StatSoft is an Intel® Software Premiere Elite Partner and has developed technologies that leverage Intel CPU architecture to deliver unmatched parallel processing performance (press release with Intel) and rapidly process terabytes of data. StatSoft’s robust, cutting-edge enterprise system technology drives the analytics and analytic data management at some of the largest computer infrastructures in the world at Fortune 100 and Fortune 500 companies.

Quotes from SAS Customers

“We acquired our SAS license seven years ago and quickly learned that with SAS, you do not pay just an annual renewal and support fee – you practically have to “buy” the software again every year. Our first year renewal fee was already 60% of the initial purchase price, and it increased steadily and every year. Two years ago, our annual fee exceeded the initial purchase price we paid, and it keeps going up much faster than the inflation. This is not sustainable.” – CEO, Technology Company

“It took 8 weeks to install SAS Enterprise Miner. The installer just didn’t work. And we’re a midsize company, so we were a low priority for SAS’s technical support.” – Engineer, Chemical Company

“Early in our evaluation, we eliminated SAS from our consideration of fraud detection solutions primarily due to the exorbitant cost.” – Chief Actuary, Insurance Company

“We had used SAS on-demand for my data mining class. A few days before finals, all of our students’ project files were corrupted. Our SAS technical support representative confirmed there was nothing that could be done to restore the files. We’re switching to STATISTICA.” – University Professor

“Now, all graduate students use R. It is getting more difficult to find SAS programmers.” – Head of Statistics, Pharmaceutical Company

“We used SAS until May 2009 when we converted to WPS. The conversion went remarkably smoothly and was completed on time. Not only did we save a substantial amount in licensing fees, we also regained functionality such as Graphs that we had previously removed because of the cost.” – Survey respondent on
How to Proceed

StatSoft makes it easy to transition your current SAS environment to STATISTICA, either gradually or all at once. STATISTICA offers:

Direct import/export to SAS files
Deployment of predictive models to SAS code to score against SAS data sets
Native integration to run R program

For more information and for specific recommendations to suit your needs, please contact one of our representatives using the form below: ,