# Monthly Archives: September 2012

## Advanced Process Monitoring Solutions for the Automotive Manufacturing Industry

Automotive manufacturers, including suppliers to the automotive industry, benefit from a multitude of *STATISTICA* products to achieve the most efficient processes in the business. Typical applications include monitoring processes, finding important controllable factors and anticipating issues before they occur. *STATISTICA *solutions available for these tasks include: *STATISTICA Enterprise QC, STATISTICA Monitoring and Alerting Server (MAS), WebSTATISTICA Enterprise QC, *and * STATISTICA Process Optimization and Root Cause Analysis.*

## Areas of Application: Monitoring Processes with *STATISTICA Enterprise QC and MAS*

*STATISTICA Enterprise QC* monitors the various critical manufacturing processes that are taking place simultaneously at the facility during testing and assembly. Immediately knowing when a process gets off spec saves time and materials. *STATISTICA Enterprise QC* offers SPC solutions for automotive suppliers to monitor processes and part testing to ensure quality of parts and assemblies.

*STATISTICA Monitoring and Alerting Server (MAS)*provides automated monitoring and dashboard summaries for highly automated automotive manufacturing and assembly processes.

## Collaborating with Suppliers using *WebSTATISTICA Enterprise QC*

*WebSTATISTICA Enterprise QC* enables automotive manufacturers to collaborate with suppliers through its web interface. This allows for the sharing of supplier data and collaborative review of results.

## Anticipating Issues before they Occur with *STATISTICA Process Optimization and Root Cause Analysis*

*STATISTICA Process Optimization and Root Cause Analysis* is an exceptional tool for monitoring the manufacturing process at each step along the way, even anticipating quality control problems with unmatched sensitivity and effectiveness. By integrating cutting-edge predictive modeling and data mining techniques with the vast array of traditional quality tools including quality control charting, process capability analysis, experimental design procedures and Six Sigma methods, *STATISTICA Process Optimization and Root Cause Analysis* allows for complete process understanding, root cause analysis, and accurate predictions of quality outcomes during the manufacturing process.

*STATISTICA Process Optimization and Root Cause Analysis* allows you to take advantage of existing historical data and find patterns in the data that affect the final outcome. As most automated manufacturing processes involve a large number of steps to get to the end product and interactions between these effects often exist, a traditional experimental design would require far too many runs. Root Cause analysis uses your historical data to find factors and combinations of factors that affect the end product quality.

*STATISTICA Process Optimization and Root Cause Analysis* builds predictive models that reflect the relationship between manufacturing inputs and outcomes (e.g., conformance to specifications) of the manufacturing process. The models can then be used to simulate runs, finding optimal settings and improving overall quality of the process.

For an overview of the application of predictive modeling to manufacturing processes, read the article from Quality Digest, Finding Direction in Chaos, Data mining methods make sense out of millions of seemingly random data points

## South African Police Service Crime Stats Report 2012

SAPS (South African Police Service) released their crime statistics report, presentation shows 3.1% drop in violent crime.

For full report, please click here

## STATISTICA Solutions using the R Language Platform

R is a programming language and environment for statistical computing. The R platform and its source code are freely available under the GNU GPL license (see http://cran.r-project.org).

## Overview

You want to use some R features, however the R platform doesn’t entirely meet your needs. ** STATISTICA Integration with R** has solutions to the following concerns:

- R tabular results are cumbersome to manipulate; R graphs cannot be modified; output is difficult to manage.

- I want to combine an R function with custom programming.

- I need a specialized R data mining algorithm that is not available in any commercial software package.

- My company needs to run reports based on R scripts, but my users don’t know how to use R.

- My company needs to run a high volume of data-intensive analyses using R algorithms, but the R program is too slow.

For additional information, see the following white papers:

## Sample Applications

### R tabular results are cumbersome to manipulate; R graphs cannot be modified; output is difficult to manage.

With all *STATISTICA* products, including *STATISTICA* (desktop), you can:

- Run R programs (or “scripts”) to produce output as
*STATISTICA*spreadsheets and graphs, which can be managed in*STATISTICA*workbooks and/or saved in*STATISTICA*reports.

### I want to combine an R function with custom programming.

With all *STATISTICA* products, including *STATISTICA* (desktop), you can:

- Build functions that are entirely (or partially) based on R by using
*STATISTICA*Visual Basic (SVB).

### I need a specialized R data mining algorithm that is not available in any commercial software package.

With *STATISTICA Data Miner*, you can:

- Create a data miner workspace where you can build and maintain models with R-based nodes.

### My company needs to run reports based on R scripts, but my users don’t know how to use R.

With *STATISTICA Enterprise*, you can:

- Generate automated reports from reusable R-based analysis configurations, which deliver the power of R to users not familiar with R.

### My company needs to run a high volume of data-intensive analyses using R algorithms, but the R program is too slow.

With *WebSTATISTICA Server*, you can:

- Off-load R scripts (as well as SVB scripts, Data Miner workspaces, etc.), creating a powerful multi-processor multi-user R server with load balancing, batch-job capabilities (scheduling), and more.

## Big Data, How to Detect Relationships Between Categorical Variables

## Association Rules Introductory Overview

The goal of the techniques described in this topic is to detect relationships or associations between specific values of categorical variables in large data sets. This is a common task in many data mining projects as well as in the data mining subcategory text mining. These powerful exploratory techniques have a wide range of applications in many areas of business practice and also research – from the analysis of consumer preferences or human resource management, to the history of language. These techniques enable analysts and researchers to uncover hidden patterns in large data sets, such as “customers who order product *A* often also order product *B* or *C*” or “employees who said positive things about initiative *X* also frequently complain about issue *Y* but are happy with issue *Z*.” The implementation of the so-called *a-priori* algorithm (see Agrawal and Swami, 1993; Agrawal and Srikant, 1994; Han and Lakshmanan, 2001; see also Witten and Frank, 2000) allows us to process rapidly huge data sets for such associations, based on predefined “threshold” values for detection.

**How association rules work. **The usefulness of this technique to address unique data mining problems is best illustrated in a simple example. Suppose we are collecting data at the check-out cash registers at a large book store. Each customer transaction is logged in a database, and consists of the titles of the books purchased by the respective customer, perhaps additional magazine titles and other gift items that were purchased, and so on. Hence, each record in the database will represent one customer (transaction), and may consist of a single book purchased by that customer, or it may consist of many (perhaps hundreds of) different items that were purchased, arranged in an arbitrary order depending on the order in which the different items (books, magazines, and so on) came down the conveyor belt at the cash register. The purpose of the analysis is to find associations between the items that were purchased, i.e., to derive association rules that identify the items and co-occurrences of different items that appear with the greatest (co-)frequencies. For example, we want to learn which books are likely to be purchased by a customer who we know already purchased (or is about to purchase) a particular book. This type of information could then quickly be used to suggest to the customer those additional titles. You may already be “familiar” with the results of these types of analyses if you are a customer of various on-line (Web-based) retail businesses; many times when making a purchase on-line, the vendor will suggest similar items (to the ones purchased by you) at the time of “check-out”, based on some rules such as “customers who buy book title *A* are also likely to purchase book title *B*,” and so on.

**Sequence Analysis. **Sequence analysis is concerned with a subsequent purchase of a product or products given a previous buy. For instance, buying an extended warranty is more likely to follow (in that specific sequential order) the purchase of a TV or other electric appliances. Sequence rules, however, are not always that obvious, and sequence analysis helps you to extract such rules no matter how hidden they may be in your market basket data. There is a wide range of applications for sequence analysis in many areas of industry including customer shopping patterns, phone call patterns, the fluctuation of the stock market, DNA sequence, and Web log streams.

**Link Analysis. **Once extracted, rules about associations or the sequences of items as they occur in a transaction database can be extremely useful for numerous applications. Obviously, in retailing or marketing, knowledge of purchase “patterns” can help with the direct marketing of special offers to the “right” or “ready” customers (i.e., those who, according to the rules, are most likely to purchase specific items given their observed past consumption patterns). However, transaction databases occur in many areas of business, such as banking. In fact, the term “link analysis” is often used when these techniques – for extracting sequential or non-sequential association rules – are applied to organize complex “evidence.” It is easy to see how the “transactions” or “shopping basket” metaphor can be applied to situations where individuals engage in certain actions, open accounts, contact other specific individuals, and so on. Applying the technologies described here to such databases may quickly extract patterns and associations between individuals and actions and, hence, for example, reveal the patterns and structure of some clandestine illegal network.

**Unique data analysis requirements. **Crosstabulation tables, and in particular Multiple Response tables can be used to analyze data of this kind. However, in cases when the number of different items (categories) in the data is very large (and not known ahead of time), and when the “factorial degree” of important association rules is not known ahead of time, then these tabulation facilities may be too cumbersome to use, or simply not applicable: Consider once more the simple “bookstore-example” discussed earlier. First, the number of book titles is practically unlimited. In other words, if we would make a table where each book title would represent one dimension, and the purchase of that book (yes/no) would be the classes or categories for each dimension, then the complete crosstabulation table would be huge and sparse (consisting mostly of empty cells). Alternatively, we could construct all possible two-way tables from all items available in the store; this would allow us to detect two-way associations (association rules) between items. However, the number of tables that would have to be constructed would again be huge, most of the two-way tables would be sparse, and worse, if there were any three-way association rules “hiding” in the data, we would miss them completely. The *a-priori* algorithm implemented in *Association Rules* will not only automatically detect the relationships (“cross-tabulation tables”) that are important (i.e., cross-tabulation tables that are not sparse, not containing mostly zero’s), but also determine the factorial degree of the tables that contain the important association rules.

To summarize, use *Association Rules* to find rules of the kind *If X then (likely) Y *where *X* and *Y* can be single values, items, words, etc., or conjunctions of values, items, words, etc. (e.g., *if (Car=Porsche and Gender=Male and Age<20) then (Risk=High and Insurance=High)*). The program can be used to analyze simple categorical variables, dichotomous variables, and/or multiple response variables. The algorithm will determine association rules without requiring the user to specify the number of distinct categories present in the data, or any prior knowledge regarding the maximum factorial degree or complexity of the important associations. In a sense, the algorithm will construct cross-tabulation tables without the need to specify the number of dimensions for the tables, or the number of categories for each dimension. Hence, this technique is particularly well suited for data and text mining of huge databases.

### Computational Procedures and Terminology

**Categorical or class variables.** Categorical variables are single variables that contains codes or text values to denote distinct classes; for example, a variable *Gender* would have the categories *Male* and *Female*.

**Multiple response variables.** Multiple response variables usually consist of multiple variables (i.e., a list of variables) that can contain, for each observations, codes or text values describing a single “dimension” or transaction. A good example of a multiple response variable would be if a vendor recorded the purchases made by a customer in a single record, where each record could contain one or more items purchased, in arbitrary order. This is a typical format in which customer transaction data would be kept.

**Multiple dichotomies.** In this data format, each variable would represent one item or category, and the dichotomous data in each variable would indicate whether or not the respective item or category applies to the respective case. For example, suppose a vendor created a data spreadsheet where each column represented one of the products available for purchase. Each transaction (row of the data spreadsheet) would record whether or not the respective customer did or did not purchase that product, i.e., whether or not the respective transaction involved each item.

**Association Rules: If Body then Head.** The *a-priori* algorithm attempts to derive from the data association rules of the form: *If “Body” then “Head”*, where *Body* and *Head* stand for simple codes or text values (items), or the conjunction of codes and text values (items; e.g., *if (Car=Porsche and Age<20) then (Risk=High and Insurance=High)*; here the logical conjunction before the then would be the *Body*, and the logical conjunction following the then would be the *Head* of the association rule).

**Initial Pass Through the Data: The Support Value.** First the program will scan all variables to determine the unique codes or text values (items) found in the variables selected for the analysis. In this initial pass, the relative frequencies with which the individual codes or text values occur in each transaction will also be computed. The probability that a transaction contains a particular code or text value is called *Support*; the *Support* value is also computed in consecutive passes through the data, as the joint probability (relative frequency of co-occurrence) of pairs, triplets, etc. of codes or text values (items), i.e., separately for the *Body* and *Head* of each association rule.

**Second Pass Through the Data: The Confidence Value; Correlation Value. **After the initial pass through the data, all items with a support value less than some predefined minimum support value will be “remembered” for subsequent passes through the data: Specifically, the conditional probabilities will be computed for all pairs of codes or text values that have support values greater than the minimum support value. This conditional probability – that an observation (transaction) that contains a code or text value *X* also contains a code or text value *Y* – is called the *Confidence Value*. In general (in later passes through the data) the confidence value denotes the conditional probability of the *Head* of the association rule, given the *Body* of the association rule.

In addition, the support value will be computed for each pair of codes or text values, and a *Correlation* value based on the support values. The correlation value for a pair of codes or text values {*X, Y*} is computed as the support value for that pair, divided by the square root of the product of the support values for *X* and *Y*. After the second pass through the data those pairs of codes or text values that (1) have a confidence value that is greater than some user-defined minimum confidence value, (2) have a support value that is greater than some user-defined minimum support value, and (3) have a correlation value that is greater than some minimum correlation value will be retained.

**Subsequent Passes Through The Data: Maximum Item Size in Body, Head. **The data in subsequent steps, the data will be further scanned computing support, confidence, and correlation values for pairs of codes or text values (associations between single codes or text values), triplets of codes or text values, and so on. To reiterate, in general, at each association rules will be derived of the general form if “*Body*” then “*Head*“, where *Body* and *Head* stand for simple codes or text values (items), or the conjunction of codes and text values (items).

Unless the process stops because no further associations can be found that satisfy the minimum support, confidence, and correlation conditions, the process could continue to build very complex association rules (e.g., *if X1 and X2 .. and X20 then Y1 and Y2 … and Y20*). To avoid excessive complexity, additionally, the user can specify the maximum number of codes or text values (items) in the *Body* and *Head* of the association rules; this value is referred to as the maximum item set size in the *Body* and *Head* of an association rule.

### Tabular Representation of Associations

Association rules are generated of the general form *if Body then Head*, where *Body* and *Head* stand for single codes or text values (items) or conjunctions of codes or text values (items; e.g., *if (Car=Porsche and Age<20) then (Risk=High and Insurance=High*). The major statistics computed for the association rules are *Support* (relative frequency of the *Body* or *Head* of the rule), *Confidence* (conditional probability of the *Head* given the *Body* of the rule), and *Correlation* (support for *Body* and *Head*, divided by the square root of the product of the support for the *Body* and the support for the *Head*). These statistics can be summarized in a spreadsheet, as shown below.

This results spreadsheet shows an example of how association rules can be applied to text mining tasks. This analysis was performed on the paragraphs (dialog spoken by the characters in the play) in the first scene of Shakespeare’s “All’s Well That Ends Well,” after removing a few very frequent words like *is*, *of*, etc. The values for support, confidence, and correlation are expressed in percent.

### Graphical Representation of Associations

As a result of applying Association Rules data mining techniques to large datasets rules of the form *if “Body” then “Head”* will be derived, where *Body* and *Head* stand for simple codes or text values (items), or the conjunction of codes and text values (items; e.g., *if (Car=Porsche and Age<20) then (Risk=High and Insurance=High)*). These rules can be reviewed in textual format or tables, or in graphical format (see below).

**Association Rules Networks, 2D.** For example, consider the data that describe a (fictitious) survey of 100 patrons of sports bars and their preferences for watching various sports on television. This would be an example of simple categorical variables, where each variable represents one sport. For each sport, each respondent indicated how frequently s/he watched the respective type of sport on television. The association rules derived from these data could be summarized as follows:

In this graph, the support values for the *Body* and *Head* portions of each association rule are indicated by the sizes and colors of each. The thickness of each line indicates the confidence value (conditional probability of Head given Body) for the respective association rule; the sizes and colors of the circles in the center, above the *Implies* label, indicate the joint support (for the co-occurrences) of the respective *Body* and *Head* components of the respective association rules. Hence, in this graphical summary, the strongest support value was found for *Swimming=Sometimes*, which was associated *Gymnastic=Sometimes*, *Baseball = Sometimes*, and *Basketball=Sometimes*. Incidentally. Unlike simple frequency and crosstabulation tables, the absolute frequencies with which individual codes or text values (items) occur in the data are often not reflected in the association rules; instead, only those codes or text values (items) are retained that show sufficient values for support, confidence, and correlation, i.e., that co-occur with other codes or text values (items) with sufficient relative (co-)frequency.

The results that can be summarized in 2D Association Rules networks can be relatively simple, or complex, as illustrated in the network shown to the left.

This is an example of how association rules can be applied to text mining tasks. This analysis was performed on the paragraphs (dialog spoken by the characters in the play) in the first scene of Shakespeare’s “All’s Well That Ends Well,” after removing a few very frequent words like *is*, *of*, etc. Of course, the specific words and phrases removed during the data preparation phase of text (or data) mining projects will depend on the purpose of the research.

**Association Rules Networks, 3D.** Association rules can be graphically summarized in 2D Association Networks, as well as 3D Association Networks. Shown below are some (very clear) results from an analysis. Respondents in a survey were asked to list their (up to) 3 favorite fast-foods. The association rules derived from those data are summarized in a 3D Association Network display.

As in the 2D Association Network, the support values for the *Body* and *Head* portions of each association rule are indicated by the sizes and colors of each circle in the 2D. The thickness of each line indicates the confidence value (joint probability) for the respective association rule; the sizes and colors of the “floating” circles plotted against the (vertical) z-axis indicate the joint support (for the co-occurrences) of the respective *Body* and *Head* components of the association rules. The plot position of each circle along the vertical z – axis indicates the respective confidence value. Hence, this particular graphical summary clearly shows two simple rules: Respondents who name *Pizza* as a preferred fast food also mention *Hamburger*, and vice versa.

### Interpreting and Comparing Results

When comparing the results of applying association rules to those from simple frequency or cross-tabulation tables, we may notice that in some cases very high-frequency codes or text values (items) are not part of any association rule. This can sometimes be perplexing.

To illustrate how this pattern of findings can occur, consider this example: Suppose we analyzed data from a survey of insurance rates for different makes of automobiles in America. Simple tabulation would very likely show that many people drive automobiles manufactured by Ford, GM, and Chrysler; however, none of these makes may be associated with particular patterns in insurance rates, i.e., none of these brands may be involved in high-confidence, high-correlation association rules linking them to particular categories of insurance rates. However, when applying association rules methods, automobile makes that occur in the sample with relatively low frequency (e.g., Porsche) may be found to be associated with high insurance rates (allowing us to infer, for example, a rule that *if Car=Porsche then Insurance=High*). If we only reviewed a simple cross-tabulation table (make of car by insurance rate) this high-confidence association rule may well have gone unnoticed.

## Boosting Trees for Regression and Classification

## Boosting Trees for Regression and Classification Introductory Overview

The general computational approach of stochastic gradient boosting is also known by the names TreeNet (TM Salford Systems, Inc.) and MART (TM Jerill, Inc.). Over the past few years, this technique has emerged as one of the most powerful methods for predictive data mining. Some implementations of these powerful algorithms allow them to be used for regression as well as classification problems, with continuous and/or categorical predictors. Detailed technical descriptions of these methods can be found in Friedman (1999a, b) as well as Hastie, Tibshirani, & Friedman (2001).

## Gradient Boosting Trees

The algorithm for Boosting Trees evolved from the application of boosting methods to regression trees. The general idea is to compute a sequence of (very) simple trees, where each successive tree is built for the prediction residuals of the preceding tree. As described in the General Classification and Regression Trees Introductory Overview, this method will build binary trees, i.e., partition the data into two samples at each split node. Now suppose that you were to limit the complexities of the trees to 3 nodes only: a root node and two child nodes, i.e., a single split. Thus, at each step of the boosting (boosting trees algorithm), a simple (best) partitioning of the data is determined, and the deviations of the observed values from the respective means (residuals for each partition) are computed. The next 3-node tree will then be fitted to those residuals, to find another partition that will further reduce the residual (error) variance for the data, given the preceding sequence of trees.

It can be shown that such “additive weighted expansions” of trees can eventually produce an excellent fit of the predicted values to the observed values, even if the specific nature of the relationships between the predictor variables and the dependent variable of interest is very complex (nonlinear in nature). Hence, the method of gradient boosting – fitting a weighted additive expansion of simple trees – represents a very general and powerful machine learning algorithm.

## The Problem of Overfitting; Stochastic Gradient Boosting

One of the major problems of all machine learning algorithms is to “know when to stop,” i.e., how to prevent the learning algorithm to fit esoteric aspects of the training data that are not likely to improve the predictive validity of the respective model. This issue is also known as the problem of overfitting. To reiterate, this is a general problem applicable to most machine learning algorithms used in predictive data mining. A general solution to this problem is to evaluate the quality of the fitted model by predicting observations in a test-sample of data that have not been used before to estimate the respective model(s). In this manner, one hopes to gage the predictive accuracy of the solution, and to detect when overfitting has occurred (or is starting to occur).

A similar approach is for each consecutive simple tree to be built for only a randomly selected subsample of the full data set. In other words, each consecutive tree is built for the prediction residuals (from all preceding trees) of an independently drawn random sample. The introduction of a certain degree of randomness into the analysis in this manner can serve as a powerful safeguard against overfitting (since each consecutive tree is built for a different sample of observations), and yield models (additive weighted expansions of simple trees) that generalize well to new observations, i.e., exhibit good predictive validity. This technique, i.e., performing consecutive boosting computations on independently drawn samples of observations, is knows as stochastic gradient boosting.

Below is a plot of the prediction error function for the training data over successive trees and also an independently sampled testing data set at each stage.

With this graph, you can identify very quickly the point where the model (consisting of a certain number of successive trees) begins to overfit the data. Notice how the prediction error for the training data steadily decreases as more and more additive terms (trees) are added to the model. However, somewhere past 35 trees, the performance for independently sampled testing data actually begins to deteriorate, clearly indicating the point where the model begins to overfit the data.

## Stochastic Gradient Boosting Trees and Classification

So far, the discussion of boosting trees has exclusively focused on regression problems, i.e., on the prediction of a continuous dependent variable. The technique can easily be expanded to handle classification problems as well (this is described in detail in Friedman, 1999a, section 4.6; in particular, see Algorithm 6).

First, different boosting trees are built for (fitted to) each category or class of the categorical dependent variable, after creating a coded variable (vector) of values for each class with the values 1 or 0 to indicate whether or not an observation does or does not belong to the respective class. In successive boosting steps, the algorithm will apply the logistic transformation (see also *Nonlinear Estimation*) to compute the residuals for subsequent boosting steps. To compute the final classification probabilities, the logistic transformation is again applied to the predictions for each 0/1 coded vector (class). This algorithm is described in detail in Friedman (1999a; see also Hastie, Tibshirani, and Freedman, 2001, for a description of this general procedure).

## Large Numbers of Categories

Note that the procedure for applying this method to classification problems requires that separate sequences of (boosted) trees be built for each category or class. Hence, the computational effort generally becomes larger by a multiple of what it takes to solve a simple regression prediction problem (for a single continuous dependent variable). Therefore, it is not prudent to analyze categorical dependent variables (class variables) with more than, approximately, 100 or so classes; past that point, the computations performed may require an unreasonable amount of effort and time. (For example, a problem with 200 boosting steps and 100 categories or classes for the dependent variable would yield 200 * 100 = 20,000 individual trees.)

## Variance Components and Mixed Model ANOVA/ANCOVA

Variance Components and Mixed Model ANOVA/ANCOVA

The *Variance Components and Mixed Model ANOVA/ANCOVA* section describes a comprehensive set of techniques for analyzing research designs that include random effects; however, these techniques are also well suited for analyzing large main effect designs (e.g., designs with more than 200 levels per factor), designs with many factors where the higher order interactions are not of interest, and analyses involving case weights.

There are several sections in this electronic textbook that discuss Analysis of Variance for factorial or specialized designs. For a discussion of these sections and the types of designs for which they are best suited, refer to Methods for Analysis of Variance. Note, however, that General Linear Models describes how to analyze designs with any number and type of between effects and compute ANOVA-based variance component estimates for any effect in a mixed-model analysis.

Experimentation is sometimes mistakenly thought to involve only the manipulation of levels of the independent variables and the observation of subsequent responses on the dependent variables. Independent variables whose levels are determined or set by the experimenter are said to have *fixed effects*. There is a second class of effects, however, that is often of great interest to the researcher; *random effects* are classification effects where the levels of the effects are assumed to be randomly selected from an infinite population of possible levels.

Many independent variables of research interest are not fully amenable to experimental manipulation, but nevertheless can be studied by considering them to have random effects. For example, the genetic makeup of individual members of a species cannot at present be (fully) experimentally manipulated, yet it is of great interest to the geneticist to assess the genetic contribution to individual variation on outcomes such as health, behavioral characteristics, and the like. As another example, a manufacturer might want to estimate the components of variation in the characteristics of a product for a random sample of machines operated by a random sample of operators. The statistical analysis of random effects is accomplished by using the *random effect model* if all of the independent variables are assumed to have random effects, or by using the *mixed model* if some of the independent variables are assumed to have random effects and other independent variables are assumed to have fixed effects.

**Properties of random effects. **To illustrate some of the properties of *random effects*, suppose you collected data on the amount of insect damage to different varieties of wheat. It is impractical to study insect damage for every possible variety of wheat, so to conduct the experiment, you randomly select four varieties of wheat to study. Plant damage is rated for up to a maximum of four plots per variety. Ratings are on a 0 (no damage) to 10 (great damage) scale. The following data for this example are presented in Milliken and Johnson (1992, p. 237).

DATA: wheat.sta 3v | ||
---|---|---|

VARIETY | PLOT | DAMAGE |

A A A B B B B C C C C D D |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
3.90 4.05 4.25 3.60 4.20 4.05 3.85 4.15 4.60 4.15 4.40 3.35 3.80 |

To determine the components of variation in resistance to insect damage for *Variety* and *Plot*, an ANOVA can first be performed. Perhaps surprisingly, in the ANOVA, *Variety* can be treated as a fixed or as a random factor without influencing the results (provided that *Type I Sums of squares* are used and that *Variety* is always entered first in the model). The spreadsheet below shows the ANOVA results of a mixed model analysis treating *Variety* as a *fixed effect* and ignoring *Plot*, i.e., treating the plot-to-plot variation as a measure of random error.

ANOVA Results: DAMAGE (wheat.sta) | |||||||
---|---|---|---|---|---|---|---|

Effect |
Effect (F/R) |
df Effect |
MS Effect |
df Error |
MS Error |
F |
p |

{1}VARIETY | Fixed | 3 | .270053 | 9 | .056435 | 4.785196 | .029275 |

Another way to perform the same mixed model analysis is to treat *Variety* as a *fixed effect* and *Plot* as a *random effect*. The spreadsheet below shows the ANOVA results for this mixed model analysis.

ANOVA Results for Synthesized Errors: DAMAGE (wheat.sta) | |||||||
---|---|---|---|---|---|---|---|

df error computed using Satterthwaite method | |||||||

Effect |
Effect (F/R) |
df Effect |
MS Effect |
df Error |
MS Error |
F |
p |

{1}VARIETY {2}PLOT |
Fixed Random |
3 9 |
.270053 .056435 |
9 —– |
.056435 —– |
4.785196 —– |
.029275 —– |

The spreadsheet below shows the ANOVA results for a *random effect* model treating *Plot* as a *random effect* nested within *Variety*, which is also treated as a *random effect*.

ANOVA Results for Synthesized Errors: DAMAGE (wheat.sta) | |||||||
---|---|---|---|---|---|---|---|

df error computed using Satterthwaite method | |||||||

Effect |
Effect (F/R) |
df Effect |
MS Effect |
df Error |
MS Error |
F |
p |

{1}VARIETY {2}PLOT |
Random Random |
3 9 |
.270053 .056435 |
9 —– |
.056435 —– |
4.785196 —– |
.029275 —– |

As can be seen, the tests of significance for the *Variety* effect are identical in all three analyses (and in fact, there are even more ways to produce the same result). When components of variance are estimated, however, the difference between the mixed model (treating *Variety* as fixed) and the random model (treating *Variety* as random) becomes apparent. The spreadsheet below shows the *variance component* estimates for the mixed model treating *Variety* as a *fixed effect*.

Components of Variance (wheat.sta) | |
---|---|

Mean Squares Type: 1 | |

Source | DAMAGE |

{2}PLOT Error |
.056435 0.000000 |

The spreadsheet below shows the *variance component* estimates for the *random effects* model treating *Variety* and *Plot* as *random effects*.

Components of Variance (wheat.sta) | |
---|---|

Mean Squares Type: 1 | |

Source | DAMAGE |

{1}VARIETY {2}PLOT Error |
.067186 .056435 0.000000 |

As can be seen, the difference in the two sets of estimates is that a *variance component* is estimated for *Variety* only when it is considered to be a *random effect*. This reflects the basic distinction between *fixed* and *random effects*. The variation in the levels of random factors is assumed to be representative of the variation of the whole population of possible levels. Thus, variation in the levels of a random factor can be used to estimate the population variation. Even more importantly, covariation between the levels of a random factor and responses on a dependent variable can be used to estimate the population component of variance in the dependent variable attributable to the random factor. The variation in the levels of fixed factors is instead considered to be arbitrarily determined by the experimenter (i.e., the experimenter can make the levels of a fixed factor vary as little or as much as desired). Thus, the variation of a fixed factor cannot be used to estimate its population variance, nor can the population covariance with the dependent variable be meaningfully estimated. With this basic distinction between *fixed effects* and *random effects* in mind, we now can look more closely at the properties of *variance components*.

Estimation of Variance Components (Technical Overview)

The basic goal of *variance component* estimation is to estimate the population covariation between random factors and the dependent variable. Depending on the method used to estimate variance components, the population variances of the random factors can also be estimated, and significance tests can be performed to test whether the population covariation between the random factors and the dependent variable are nonzero.

**Estimating the variation of random factors. **The ANOVA method provides an integrative approach to estimating variance components, because ANOVA techniques can be used to estimate the variance of random factors, to estimate the components of variance in the dependent variable attributable to the random factors, and to test whether the variance components differ significantly from zero. The ANOVA method for estimating the variance of the random factors begins by constructing the *Sums of squares and cross products* (*SSCP*) matrix for the independent variables. The sums of squares and cross products for the random effects are then residualized on the *fixed effects*, leaving the *random effects* independent of the *fixed effects*, as required in the mixed model (see, for example, Searle, Casella, & McCulloch, 1992). The residualized *Sums of squares and cross products* for each random factor are then divided by their degrees of freedom to produce the coefficients in the *Expected mean squares* matrix. Nonzero off-diagonal coefficients for the random effects in this matrix indicate confounding, which must be taken into account when estimating the population variance for each factor. For the *wheat.sta* data, treating both *Variety* and *Plot* as random effects, the coefficients in the Expected mean squares matrix show that the two factors are at least somewhat confounded. The *Expected mean squares* spreadsheet is shown below.

Expected Mean Squares (wheat.sta) | ||||
---|---|---|---|---|

Mean Squares Type: 1 | ||||

Source |
Effect (F/R) |
VARIETY |
PLOT |
Error |

{1}VARIETY {2}PLOT Error |
Random Random |
3.179487 | 1.000000 1.000000 |
1.000000 1.000000 1.000000 |

The coefficients in the *Expected mean squares* matrix are used to estimate the population variation of the random effects by equating their variances to their expected mean squares. For example, the estimated population variance for *Variety* using *Type I Sums of squares* would be 3.179487 times the *Mean square* for *Variety* plus 1 times the *Mean square* for *Plot* plus 1 times the *Mean square* for *Error*.

The ANOVA method provides an integrative approach to estimating *variance components*, but it is not without problems (i.e., ANOVA estimates of *variance components* are generally *biased*, and can be *negative*, even though variances, by definition, must be either zero or positive). An alternative to ANOVA estimation is provided by *maximum likelihood* estimation. Maximum likelihood methods for estimating variance components are based on quadratic forms and typically, but not always, require iteration to find a solution. Perhaps the simplest form of maximum likelihood estimation is *MIVQUE(0)* estimation. MIVQUE(0) produces Minimum Variance Quadratic Unbiased Estimators (i.e., MIVQUE). In MIVQUE(0) estimation, there is no weighting of the *random effects* (thus the 0 [zero] after MIVQUE), so an iterative solution for estimating variance components is not required. MIVQUE(0) estimation begins by constructing the *Quadratic sums of squares (SSQ)* matrix. The elements for the random effects in the *SSQ* matrix can most simply be described as the sums of squares of the sums of squares and cross products for each random effect in the model (after residualization on the *fixed effects*). The elements of this matrix provide coefficients, similar to the elements of the *Expected Mean Squares* matrix, which are used to estimate the covariances among the random factors and the dependent variable. The *SSQ* matrix for the *wheat.sta* data is shown below. Note that the nonzero off-diagonal element for *Variety* and *Plot* again shows that the two random factors are at least somewhat confounded.

MIVQUE(0) Variance Component Estimation (wheat.sta) | ||||
---|---|---|---|---|

SSQ Matrix | ||||

Source | VARIETY | PLOT | Error | DAMAGE |

{1}VARIETY {2}PLOT Error |
31.90533 9.53846 9.53846 |
9.53846 12.00000 12.00000 |
9.53846 12.00000 12.00000 |
2.418964 1.318077 1.318077 |

*Restricted Maximum Likelihood (REML)* and *Maximum Likelihood (ML) variance component* estimation methods are closely related to MIVQUE(0). In fact, in some programs, *REML* and *ML* use *MIVQUE(0)* estimates as start values for an iterative solution for the *variance components*, so the elements of the *SSQ* matrix serve as initial estimates of the covariances among the random factors and the dependent variable for both *REML* and *ML*.

**Estimating components of variation. **For ANOVA methods for estimating *variance components*, a solution is found for the system of equations relating the estimated population variances and covariances among the random factors to the estimated population covariances between the random factors and the dependent variable. The solution then defines the *variance components*. The spreadsheet below shows the *Type I Sums of squares* estimates of the *variance components* for the *wheat.sta*data.

Components of Variance (wheat.sta) | |
---|---|

Mean Squares Type: 1 | |

Source | DAMAGE |

{1}VARIETY {2}PLOT Error |
0.067186 0.056435 0.000000 |

MIVQUE(0) variance components are estimated by inverting the partition of the *SSQ* matrix that does not include the dependent variable (or finding the generalized inverse, for singular matrices), and postmultiplying the inverse by the dependent variable column vector. This amounts to solving the system of equations that relates the dependent variable to the random independent variables, taking into account the covariation among the independent variables. The MIVQUE(0) estimates for the *wheat.sta*data are listed in the spreadsheet shown below.

MIVQUE(0) Variance Component Estimation (wheat.sta) | |
---|---|

Variance Components | |

Source | DAMAGE |

{1}VARIETY {2}PLOT Error |
0.056376 0.065028 0.000000 |

REML and ML variance components are estimated by iteratively optimizing the parameter estimates for the effects in the model. REML differs from ML in that the likelihood of the data is maximized only for the *random effects*, thus REML is a restricted solution. In both REML and ML estimation, an iterative solution is found for the weights for the random effects in the model that maximizes the likelihood of the data. Some programs use MIVQUE(0) estimates as the start values for both REML and ML estimation, so the relation among these three techniques is close indeed. The statistical theory underlying *maximum likelihood* variance component estimation techniques is an advanced topic (Searle, Casella, & McCulloch, 1992, is recommended as an authoritative and comprehensive source). Implementation of maximum likelihood estimation algorithms, furthermore, is difficult (see, for example, Hemmerle & Hartley, 1973, and Jennrich & Sampson, 1976, for descriptions of these algorithms), and faulty implementation can lead to variance component estimates that lie outside the parameter space, converge prematurely to nonoptimal solutions, or give nonsensical results. Milliken and Johnson (1992) noted all of these problems with the commercial software packages they used to estimate variance components.

The basic idea behind both REML and ML estimation is to find the set of weights for the *random effects* in the model that minimize the negative of the natural logarithm times the likelihood of the data (the likelihood of the data can vary from zero to one, so minimizing the negative of the natural logarithm times the likelihood of the data amounts to maximizing the probability, or the likelihood, of the data). The logarithm of the REML likelihood and the REML variance component estimates for the *wheat.sta* data are listed in the last row of the *Iteration history* spreadsheet shown below.

Iteration History (wheat.sta) | ||||
---|---|---|---|---|

Variable: DAMAGE | ||||

Iter. | Log LL | Error | VARIETY | |

1 2 3 4 5 6 7 |
-2.30618 -2.25253 -2.25130 -2.25088 -2.25081 -2.25081 -2.25081 |
.057430 .057795 .056977 .057005 .057006 .057003 .057003 |
.068746 .073744 .072244 .073138 .073160 .073155 .073155 |

The logarithm of the ML likelihood and the ML estimates for the variance components for the *wheat.sta* data are listed in the last row of the *Iteration history*spreadsheet shown below.

Iteration History (wheat.sta) | ||||
---|---|---|---|---|

Variable: DAMAGE | ||||

Iter. | Log LL | Error | VARIETY | |

1 2 3 4 5 6 |
-2.53585 -2.48382 -2.48381 -2.48381 -2.48381 -2.48381 |
.057454 .057427 .057492 .057491 .057492 .057492 |
.048799 .048541 .048639 .048552 .048552 .048552 |

As can be seen, the estimates of the variance components for the different methods are quite similar. In general, components of variance using different estimation methods tend to agree fairly well (see, for example, Swallow & Monahan, 1984).

**Testing the significance of variance components. **When maximum likelihood estimation techniques are used, standard linear model significance testing techniques may not be applicable. ANOVA techniques such as decomposing sums of squares and testing the significance of effects by taking ratios of mean squares are appropriate for linear methods of estimation, but generally are not appropriate for quadratic methods of estimation. When ANOVA methods are used for estimation, standard significance testing techniques can be employed, with the exception that any confounding among random effects must be taken into account.

To test the significance of effects in mixed or random models, error terms must be constructed that contain all the same sources of random variation except for the variation of the respective effect of interest. This is done using Satterthwaite’s method of denominator synthesis (Satterthwaite, 1946), which finds the linear combinations of sources of random variation that serve as appropriate error terms for testing the significance of the respective effect of interest. The spreadsheet below shows the coefficients used to construct these linear combinations for testing the *Variety* and *Plot* effects.

Denominator Synthesis: Coefficients (MS Type: 1) (wheat.sta) | ||||
---|---|---|---|---|

The synthesized MS Errors are linear combinations of the resp. MS effects |
||||

Effect | (F/R) | VARIETY | PLOT | Error |

{1}VARIETY {2}PLOT |
Random Random |
1.000000 |
1.000000 |

The coefficients show that the *Mean square* for *Variety* should be tested against the *Mean square* for *Plot*, and that the *Mean square* for *Plot* should be tested against the *Mean square* for *Error*. Referring back to the *Expected mean squares* spreadsheet, it is clear that the denominator synthesis has identified appropriate error terms for testing the *Variety* and *Plot*effects. Although this is a simple example, in more complex analyses with various degrees of confounding among the random effects, the denominator synthesis can identify appropriate error terms for testing the random effects that would not be readily apparent.

To perform the tests of significance of the random effects, ratios of appropriate Mean squares are formed to compute *F* statistics and *p-*values for each effect. Note that in complex analyses, the degrees of freedom for random effects can be fractional rather than integer values, indicating that fractions of sources of variation were used in synthesizing appropriate error terms for testing the random effects. The spreadsheet displaying the results of the ANOVA for the *Variety* and *Plot* *random effects* is shown below. Note that, for this simple design, the results are identical to the results presented earlier in the spreadsheet for the ANOVA treating *Plot* as a *random effect* nested within *Variety*.

ANOVA Results for Synthesized Errors: DAMAGE (wheat.sta) | |||||||
---|---|---|---|---|---|---|---|

df error computed using Satterthwaite method | |||||||

Effect |
Effect (F/R) |
df Effect |
MS Effect |
df Error |
MS Error |
F |
p |

{1}VARIETY {2}PLOT |
Fixed Random |
3 9 |
.270053 .056435 |
9 —– |
.056435 —– |
4.785196 —– |
.029275 —– |

As shown in the spreadsheet, the *Variety* effect is found to be significant at *p* < .05, but as would be expected, the *Plot* effect cannot be tested for significance because plots served as the basic unit of analysis. If data on samples of plants taken within plots were available, a test of the significance of the *Plot*effect could be constructed.

Appropriate tests of significance for MIVQUE(0) variance component estimates generally cannot be constructed except in special cases (see Searle, Casella, & McCulloch, 1992). Asymptotic (large sample) tests of significance of REML and ML variance component estimates, however, can be constructed for the parameter estimates from the final iteration of the solution. The spreadsheet below shows the asymptotic (large sample) tests of significance for the REML estimates for the *wheat.sta* data.

Restricted Maximum Likelihood Estimates (wheat.sta) | ||||
---|---|---|---|---|

Variable: DAMAGE -2*Log(Likelihood)=4.50162399 |
||||

Effect |
Variance Comp. |
Asympt. Std.Err. |
Asympt. z |
Asympt. p |

{1}VARIETY Error |
.073155 .057003 |
.078019 .027132 |
.937656 2.100914 |
.348421 .035648 |

The spreadsheet below shows the asymptotic (large sample) tests of significance for the ML estimates for the *wheat.sta*data.

Maximum Likelihood Estimates (wheat.sta) | ||||
---|---|---|---|---|

Variable: DAMAGE -2*Log(Likelihood)=4.96761616 |
||||

Effect |
Variance Comp. |
Asympt. Std.Err. |
Asympt. z |
Asympt. p |

{1}VARIETY Error |
.048552 .057492 |
.050747 .027598 |
.956748 2.083213 |
.338694 .037232 |

It should be emphasized that the asymptotic tests of significance for REML and ML variance component estimates are based on large sample sizes, which certainly is not the case for the *wheat.sta*data. For this data set, the tests of significance from both analyses agree in suggesting that the Variety variance component does not differ significantly from zero.

For basic information on ANOVA in linear models, see also *Elementary Concepts*.

**Estimating the population intraclass correlation.** Note that if the *variance component* estimates for the random effects in the model are divided by the sum of all components (including the error component), the resulting percentages are population intraclass correlation coefficientsfor the respective effects.

## How To Identify Patterns in Time Series Data: Time Series Analysis

# How To Identify Patterns in Time Series Data: Time Series Analysis

In the following topics, we will first review techniques used to identify patterns in time series data (such as smoothing and curve fitting techniques and autocorrelations), then we will introduce a general class of models that can be used to represent time series data and generate predictions (autoregressive and moving average models). Finally, we will review some simple but commonly used modeling and forecasting techniques based on linear regression. For more information see the topics below.

## General Introduction

In the following topics, we will review techniques that are useful for analyzing time series data, that is, sequences of measurements that follow non-random orders. Unlike the analyses of random samples of observations that are discussed in the context of most other statistics, the analysis of time series is based on the assumption that successive values in the data file represent consecutive measurements taken at equally spaced time intervals.

Detailed discussions of the methods described in this section can be found in Anderson (1976), Box and Jenkins (1976), Kendall (1984), Kendall and Ord (1990), Montgomery, Johnson, and Gardiner (1990), Pankratz (1983), Shumway (1988), Vandaele (1983), Walker (1991), and Wei (1989).

## Two Main Goals

There are two main goals of time series analysis: (a) identifying the nature of the phenomenon represented by the sequence of observations, and (b) forecasting (predicting future values of the time series variable). Both of these goals require that the pattern of observed time series data is identified and more or less formally described. Once the pattern is established, we can interpret and integrate it with other data (i.e., use it in our theory of the investigated phenomenon, e.g., seasonal commodity prices). Regardless of the depth of our understanding and the validity of our interpretation (theory) of the phenomenon, we can extrapolate the identified pattern to predict future events.

## Identifying Patterns in Time Series Data

- Systematic pattern and random noise
- Two general aspects of time series patterns
- Trend Analysis
- Analysis of Seasonality

For more information on simple autocorrelations (introduced in this section) and other auto correlations, see Anderson (1976), Box and Jenkins (1976), Kendall (1984), Pankratz (1983), and Vandaele (1983). See also:

- ARIMA (Box & Jenkins) and Autocorrelations
- Interrupted Time Series
- Exponential Smoothing
- Seasonal Decomposition (Census I)
- X-11 Census method II seasonal adjustment
- X-11 Census method II result tables
- Distributed Lags Analysis
- Single Spectrum (Fourier) Analysis
- Cross-spectrum Analysis
- Basic Notations and Principles
- Fast Fourier Transformations

### Systematic Pattern and Random Noise

As in most other analyses, in time series analysis it is assumed that the data consist of a systematic pattern (usually a set of identifiable components) and random noise (error) which usually makes the pattern difficult to identify. Most time series analysis techniques involve some form of filtering out noise in order to make the pattern more salient.

### Two General Aspects of Time Series Patterns

Most time series patterns can be described in terms of two basic classes of components: trend and seasonality. The former represents a general systematic linear or (most often) nonlinear component that changes over time and does not repeat or at least does not repeat within the time range captured by our data (e.g., a plateau followed by a period of exponential growth). The latter may have a formally similar nature (e.g., a plateau followed by a period of exponential growth), however, it repeats itself in systematic intervals over time. Those two general classes of time series components may coexist in real-life data. For example, sales of a company can rapidly grow over years but they still follow consistent seasonal patterns (e.g., as much as 25% of yearly sales each year are made in December, whereas only 4% in August).

This general pattern is well illustrated in a “classic” *Series G* data set (Box and Jenkins, 1976, p. 531) representing monthly international airline passenger totals (measured in thousands) in twelve consecutive years from 1949 to 1960 (see example data file *G.sta* and graph above). If you plot the successive observations (months) of airline passenger totals, a clear, almost linear trend emerges, indicating that the airline industry enjoyed a steady growth over the years (approximately 4 times more passengers traveled in 1960 than in 1949). At the same time, the monthly figures will follow an almost identical pattern each year (e.g., more people travel during holidays than during any other time of the year). This example data file also illustrates a very common general type of pattern in time series data, where the amplitude of the seasonal changes increases with the overall trend (i.e., the variance is correlated with the mean over the segments of the series). This pattern which is called *multiplicative seasonality* indicates that the relative amplitude of seasonal changes is constant over time, thus it is related to the trend.

### Trend Analysis

There are no proven “automatic” techniques to identify trend components in the time series data; however, as long as the trend is monotonous (consistently increasing or decreasing) that part of data analysis is typically not very difficult. If the time series data contain considerable error, then the first step in the process of trend identification is smoothing.

**Smoothing. **Smoothing always involves some form of local averaging of data such that the nonsystematic components of individual observations cancel each other out. The most common technique is *moving average* smoothing which replaces each element of the series by either the simple or weighted average of *n* surrounding elements, where *n* is the width of the smoothing “window” (see Box & Jenkins, 1976; Velleman & Hoaglin, 1981). Medians can be used instead of means. The main advantage of median as compared to moving average smoothing is that its results are less biased by outliers (within the smoothing window). Thus, if there are outliers in the data (e.g., due to measurement errors), median smoothing typically produces smoother or at least more “reliable” curves than moving average based on the same window width. The main disadvantage of median smoothing is that in the absence of clear outliers it may produce more “jagged” curves than moving average and it does not allow for weighting.

In the relatively less common cases (in time series data), when the measurement error is very large, the *distance weighted least squares smoothing* or *negative exponentially weighted smoothing* techniques can be used. All those methods will filter out the noise and convert the data into a smooth curve that is relatively unbiased by outliers (see the respective sections on each of those methods for more details). Series with relatively few and systematically distributed points can be smoothed with *bicubic splines*.

**Fitting a function. **Many monotonous time series data can be adequately approximated by a linear function; if there is a clear monotonous nonlinear component, the data first need to be transformed to remove the nonlinearity. Usually a logarithmic, exponential, or (less often) polynomial function can be used.

### Analysis of Seasonality

Seasonal dependency (seasonality) is another general component of the time series pattern. The concept was illustrated in the example of the airline passengers data above. It is formally defined as correlational dependency of order *k* between each *i*‘th element of the series and the (*i-k*)’th element (Kendall, 1976) and measured by autocorrelation (i.e., a correlation between the two terms); *k* is usually called the *lag*. If the measurement error is not too large, seasonality can be visually identified in the series as a pattern that repeats every *k* elements.

**Autocorrelation correlogram. **Seasonal patterns of time series can be examined via correlograms. The correlogram (autocorrelogram) displays graphically and numerically the autocorrelation function (*ACF*), that is, serial correlation coefficients (and their standard errors) for consecutive lags in a specified range of lags (e.g., 1 through 30). Ranges of two standard errors for each lag are usually marked in correlograms but typically the size of auto correlation is of more interest than its reliability (see *Elementary Concepts*) because we are usually interested only in very strong (and thus highly significant) autocorrelations.

**Examining correlograms. **While examining correlograms, you should keep in mind that autocorrelations for consecutive lags are formally dependent. Consider the following example. If the first element is closely related to the second, and the second to the third, then the first element must also be somewhat related to the third one, etc. This implies that the pattern of serial dependencies can change considerably after removing the first order auto correlation (i.e., after differencing the series with a lag of 1).

**Partial autocorrelations. **Another useful method to examine serial dependencies is to examine the partial autocorrelation function (*PACF*) – an extension of autocorrelation, where the dependence on the intermediate elements (those *within* the lag) is removed. In other words the partial autocorrelation is similar to autocorrelation, except that when calculating it, the (auto) correlations with all the elements within the lag are partialled out (Box & Jenkins, 1976; see also McDowall, McCleary, Meidinger, & Hay, 1980). If a lag of 1 is specified (i.e., there are no intermediate elements within the lag), then the partial autocorrelation is equivalent to auto correlation. In a sense, the partial autocorrelation provides a “cleaner” picture of serial dependencies for individual lags (not confounded by other serial dependencies).

**Removing serial dependency. **Serial dependency for a particular lag of *k* can be removed by differencing the series, that is converting each *i*‘th element of the series into its difference from the (*i-k*)”th element. There are two major reasons for such transformations.

First, we can identify the hidden nature of seasonal dependencies in the series. Remember that, as mentioned in the previous paragraph, autocorrelations for consecutive lags are interdependent. Therefore, removing some of the autocorrelations will change other auto correlations, that is, it may eliminate them or it may make some other seasonalities more apparent.

The other reason for removing seasonal dependencies is to make the series *stationary* which is necessary for ARIMA and other techniques.

## ARIMA

- General Introduction
- Two Common Processes
- ARIMA Methodology
- Identification Phase
- Parameter Estimation
- Evaluation of the Model

For more information on *Time Series* methods, see also:

- Identifying Patterns in Time Series Data
- Interrupted Time Series
- Exponential Smoothing
- Seasonal Decomposition (Census I)
- X-11 Census method II seasonal adjustment
- X-11 Census method II result tables
- Distributed Lags Analysis
- Single Spectrum (Fourier) Analysis
- Cross-spectrum Analysis
- Basic Notations and Principles
- Fast Fourier Transformations

### General Introduction

The modeling and forecasting procedures discussed in *Identifying Patterns in Time Series Data* involved knowledge about the mathematical model of the process. However, in real-life research and practice, patterns of the data are unclear, individual observations involve considerable error, and we still need not only to uncover the hidden patterns in the data but also generate forecasts. The ARIMA methodology developed by Box and Jenkins (1976) allows us to do just that; it has gained enormous popularity in many areas and research practice confirms its power and flexibility (Hoff, 1983; Pankratz, 1983; Vandaele, 1983). However, because of its power and flexibility, ARIMA is a complex technique; it is not easy to use, it requires a great deal of experience, and although it often produces satisfactory results, those results depend on the researcher’s level of expertise (Bails & Peppers, 1982). The following sections will introduce the basic ideas of this methodology. For those interested in a brief, applications-oriented (non- mathematical), introduction to ARIMA methods, we recommend McDowall, McCleary, Meidinger, and Hay (1980).

### Two Common Processes

**Autoregressive process. **Most time series consist of elements that are serially dependent in the sense that you can estimate a coefficient or a set of coefficients that describe consecutive elements of the series from specific, time-lagged (previous) elements. This can be summarized in the equation: x_{t} = + _{1}*x_{(t-1)} + _{2}*x_{(t-2)} + _{3}*x_{(t-3)} + … +

Where:

is a constant (intercept), and

_{1}, _{2}, _{3} are the autoregressive model parameters.

Put into words, each observation is made up of a random error component (random shock, ) and a linear combination of prior observations.

**Stationarity requirement. **Note that an autoregressive process will only be stable if the parameters are within a certain range; for example, if there is only one autoregressive parameter then is must fall within the interval of -1 < < 1. Otherwise, past effects would accumulate and the values of successive *x _{t}*‘ s would move towards infinity, that is, the series would not be stationary. If there is more than one autoregressive parameter, similar (general) restrictions on the parameter values can be defined (e.g., see Box & Jenkins, 1976; Montgomery, 1990).

**Moving average process.** Independent from the autoregressive process, each element in the series can also be affected by the past error (or random shock) that cannot be accounted for by the autoregressive component, that is:

x_{t} = µ + _{t} – _{1}*_{(t-1)} – _{2}*_{(t-2)} – _{3}*_{(t-3)} – …

Where:

µ is a constant, and

_{1}, _{2}, _{3} are the moving average model parameters.

Put into words, each observation is made up of a random error component (random shock, ) and a linear combination of prior random shocks.

**Invertibility requirement. **Without going into too much detail, there is a “duality” between the moving average process and the autoregressive process (e.g., see Box & Jenkins, 1976; Montgomery, Johnson, & Gardiner, 1990), that is, the moving average equation above can be rewritten (*inverted*) into an autoregressive form (of infinite order). However, analogous to the stationarity condition described above, this can only be done if the moving average parameters follow certain conditions, that is, if the model is *invertible*. Otherwise, the series will not be stationary.

### ARIMA Methodology

**Autoregressive moving average model. **The general model introduced by Box and Jenkins (1976) includes autoregressive as well as moving average parameters, and explicitly includes differencing in the formulation of the model. Specifically, the three types of parameters in the model are: the autoregressive parameters (*p*), the number of differencing passes (*d*), and moving average parameters (*q*). In the notation introduced by Box and Jenkins, models are summarized as ARIMA (*p, d, q*); so, for example, a model described as (0, 1, 2) means that it contains 0 (zero) autoregressive (*p*) parameters and 2 moving average (*q*) parameters which were computed for the series after it was differenced once.

**Identification. **As mentioned earlier, the input series for ARIMA needs to be stationary, that is, it should have a constant mean, variance, and autocorrelation through time. Therefore, usually the series first needs to be differenced until it is stationary (this also often requires log transforming the data to stabilize the variance). The number of times the series needs to be differenced to achieve stationarity is reflected in the *d* parameter (see the previous paragraph). In order to determine the necessary level of differencing, you should examine the plot of the data and autocorrelogram. Significant changes in level (strong upward or downward changes) usually require first order non seasonal (lag=1) differencing; strong changes of slope usually require second order non seasonal differencing. Seasonal patterns require respective seasonal differencing (see below). If the estimated autocorrelation coefficients decline slowly at longer lags, first order differencing is usually needed. However, you should keep in mind that some time series may require little or no differencing, and that *over differenced* series produce less stable coefficient estimates.

At this stage (which is usually called *Identification* phase, see below) we also need to decide how many autoregressive (*p*) and moving average (*q*) parameters are necessary to yield an effective but still *parsimonious* model of the process (*parsimonious* means that it has the fewest parameters and greatest number of degrees of freedom among all models that fit the data). In practice, the numbers of the *p* or *q* parameters very rarely need to be greater than 2 (see below for more specific recommendations).

**Estimation and Forecasting.** At the next step (*Estimation*), the parameters are estimated (using function minimization procedures, see below; for more information on minimization procedures see also *Nonlinear Estimation*), so that the sum of squared residuals is minimized. The estimates of the parameters are used in the last stage (*Forecasting*) to calculate new values of the series (beyond those included in the input data set) and confidence intervals for those predicted values. The estimation process is performed on transformed (differenced) data; before the forecasts are generated, the series needs to be *integrated* (integration is the inverse of differencing) so that the forecasts are expressed in values compatible with the input data. This automatic integration feature is represented by the letter I in the name of the methodology (ARIMA = Auto-Regressive Integrated Moving Average).

**The constant in ARIMA models. **In addition to the standard autoregressive and moving average parameters, ARIMA models may also include a constant, as described above. The interpretation of a (statistically significant) constant depends on the model that is fit. Specifically, (1) if there are no autoregressive parameters in the model, then the expected value of the constant is , the mean of the series; (2) if there are autoregressive parameters in the series, then the constant represents the intercept. If the series is differenced, then the constant represents the mean or intercept of the differenced series; For example, if the series is differenced once, and there are no autoregressive parameters in the model, then the constant represents the mean of the differenced series, and therefore the *linear trend slope* of the un-differenced series.

### Identification Phase

**Number of parameters to be estimated. **Before the estimation can begin, we need to decide on (identify) the specific number and type of ARIMA parameters to be estimated. The major tools used in the identification phase are plots of the series, correlograms of auto correlation (ACF), and partial autocorrelation (PACF). The decision is not straightforward and in less typical cases requires not only experience but also a good deal of experimentation with alternative models (as well as the technical parameters of ARIMA). However, a majority of empirical time series patterns can be sufficiently approximated using one of the 5 basic models that can be identified based on the shape of the autocorrelogram (ACF) and partial auto correlogram (PACF). The following brief summary is based on practical recommendations of Pankratz (1983); for additional practical advice, see also Hoff (1983), McCleary and Hay (1980), McDowall, McCleary, Meidinger, and Hay (1980), and Vandaele (1983). Also, note that since the number of parameters (to be estimated) of each kind is almost never greater than 2, it is often practical to try alternative models on the same data.

*One autoregressive (p) parameter*: ACF – exponential decay; PACF – spike at lag 1, no correlation for other lags.*Two autoregressive (p) parameters*: ACF – a sine-wave shape pattern or a set of exponential decays; PACF – spikes at lags 1 and 2, no correlation for other lags.*One moving average (q) parameter*: ACF – spike at lag 1, no correlation for other lags; PACF – damps out exponentially.*Two moving average (q) parameters*: ACF – spikes at lags 1 and 2, no correlation for other lags; PACF – a sine-wave shape pattern or a set of exponential decays.*One autoregressive (p) and one moving average (q) parameter*: ACF – exponential decay starting at lag 1; PACF – exponential decay starting at lag 1.

**Seasonal models. **Multiplicative seasonal ARIMA is a generalization and extension of the method introduced in the previous paragraphs to series in which a pattern repeats seasonally over time. In addition to the non-seasonal parameters, seasonal parameters for a specified lag (established in the identification phase) need to be estimated. Analogous to the simple ARIMA parameters, these are: seasonal autoregressive (*ps*), seasonal differencing (*ds*), and seasonal moving average parameters (*qs*). For example, the model (*0,1,2*)(*0,1,1*) describes a model that includes no autoregressive parameters, 2 regular moving average parameters and 1 seasonal moving average parameter, and these parameters were computed for the series after it was differenced once with lag 1, and once seasonally differenced. The seasonal lag used for the seasonal parameters is usually determined during the identification phase and must be explicitly specified.

The general recommendations concerning the selection of parameters to be estimated (based on ACF and PACF) also apply to seasonal models. The main difference is that in seasonal series, ACF and PACF will show sizable coefficients at multiples of the seasonal lag (in addition to their overall patterns reflecting the non seasonal components of the series).

### Parameter Estimation

There are several different methods for estimating the parameters. All of them should produce very similar estimates, but may be more or less efficient for any given model. In general, during the parameter estimation phase a function minimization algorithm is used (the so-called *quasi-Newton* method; refer to the description of the *Nonlinear Estimation* method) to maximize the likelihood (probability) of the observed series, given the parameter values. In practice, this requires the calculation of the (conditional) sums of squares (SS) of the residuals, given the respective parameters. Different methods have been proposed to compute the SS for the residuals: (1) the approximate maximum likelihood method according to McLeod and Sales (1983), (2) the approximate maximum likelihood method with backcasting, and (3) the exact maximum likelihood method according to Melard (1984).

**Comparison of methods. **In general, all methods should yield very similar parameter estimates. Also, all methods are about equally efficient in most real-world time series applications. However, method *1* above, (approximate maximum likelihood, no backcasts) is the fastest, and should be used in particular for very long time series (e.g., with more than 30,000 observations). Melard’s exact maximum likelihood method (number *3* above) may also become inefficient when used to estimate parameters for seasonal models with long seasonal lags (e.g., with yearly lags of 365 days). On the other hand, you should always use the approximate maximum likelihood method first in order to establish initial parameter estimates that are very close to the actual final values; thus, usually only a few iterations with the exact maximum likelihood method (*3*, above) are necessary to finalize the parameter estimates.

**Parameter standard errors. **For all parameter estimates, you will compute so-called *asymptotic standard errors*. These are computed from the matrix of second-order partial derivatives that is approximated via finite differencing (see also the respective discussion in *Nonlinear Estimation*).

**Penalty value. **As mentioned above, the estimation procedure requires that the (conditional) sums of squares of the ARIMA residuals be minimized. If the model is inappropriate, it may happen during the iterative estimation process that the parameter estimates become very large, and, in fact, invalid. In that case, it will assign a very large value (a so-called *penalty value*) to the SS. This usually “entices” the iteration process to move the parameters away from invalid ranges. However, in some cases even this strategy fails, and you may see on the screen (during the *Estimation procedure*) very large values for the SS in consecutive iterations. In that case, carefully evaluate the appropriateness of your model. If your model contains many parameters, and perhaps an intervention component (see below), you may try again with different parameter start values.

### Evaluation of the Model

**Parameter estimates. **You will report approximate *t* values, computed from the parameter standard errors (see above). If not significant, the respective parameter can in most cases be dropped from the model without affecting substantially the overall fit of the model.

**Other quality criteria. **Another straightforward and common measure of the reliability of the model is the accuracy of its forecasts generated based on partial data so that the forecasts can be compared with known (original) observations.

However, a good model should not only provide sufficiently accurate forecasts, it should also be parsimonious and produce statistically independent residuals that contain only noise and no systematic components (e.g., the correlogram of residuals should not reveal any serial dependencies). A good test of the model is (a) to plot the residuals and inspect them for any systematic trends, and (b) to examine the autocorrelogram of residuals (there should be no serial dependency between residuals).

**Analysis of residuals. **The major concern here is that the residuals are systematically distributed across the series (e.g., they could be negative in the first part of the series and approach zero in the second part) or that they contain some serial dependency which may suggest that the ARIMA model is inadequate. The analysis of ARIMA residuals constitutes an important test of the model. The estimation procedure assumes that the residual are not (auto-) correlated and that they are normally distributed.

**Limitations. **The ARIMA method is appropriate only for a time series that is stationary (i.e., its mean, variance, and autocorrelation should be approximately constant through time) and it is recommended that there are at least 50 observations in the input data. It is also assumed that the values of the estimated parameters are constant throughout the series.

## Interrupted Time Series ARIMA

A common research questions in time series analysis is whether an outside event affected subsequent observations. For example, did the implementation of a new economic policy improve economic performance; did a new anti-crime law affect subsequent crime rates; and so on. In general, we would like to evaluate the impact of one or more discrete events on the values in the time series. This type of interrupted time series analysis is described in detail in McDowall, McCleary, Meidinger, & Hay (1980). McDowall, et. al., distinguish between three major types of impacts that are possible: (1) permanent abrupt, (2) permanent gradual, and (3) abrupt temporary. See also:

- Distributed Lags Analysis
- Single Spectrum (Fourier) Analysis
- Cross-spectrum Analysis
- Basic Notations and Principles
- Fast Fourier Transformations

## Exponential Smoothing

- General Introduction
- Simple Exponential Smoothing
- Choosing the Best Value for Parameter a (alpha)
- Indices of Lack of Fit (Error)
- Seasonal and Non-seasonal Models With or Without Trend

See also:

- Identifying Patterns in Time Series Data
- ARIMA (Box & Jenkins) and Autocorrelations
- Interrupted Time Series
- Seasonal Decomposition (Census I)
- X-11 Census method II seasonal adjustment
- X-11 Census method II result tables
- Distributed Lags Analysis
- Single Spectrum (Fourier) Analysis
- Cross-spectrum Analysis
- Basic Notations and Principles
- Fast Fourier Transformations

### General Introduction

Exponential smoothing has become very popular as a forecasting method for a wide variety of time series data. Historically, the method was independently developed by Brown and Holt. Brown worked for the US Navy during World War II, where his assignment was to design a tracking system for fire-control information to compute the location of submarines. Later, he applied this technique to the forecasting of demand for spare parts (an inventory control problem). He described those ideas in his 1959 book on inventory control. Holt’s research was sponsored by the Office of Naval Research; independently, he developed exponential smoothing models for constant processes, processes with linear trends, and for seasonal data.

Gardner (1985) proposed a “unified” classification of exponential smoothing methods. Excellent introductions can also be found in Makridakis, Wheelwright, and McGee (1983), Makridakis and Wheelwright (1989), Montgomery, Johnson, & Gardiner (1990).

### Simple Exponential Smoothing

A simple and pragmatic model for a time series would be to consider each observation as consisting of a constant (*b*) and an error component (epsilon), that is: X_{t} = b + _{t}. The constant *b* is relatively stable in each segment of the series, but may change slowly over time. If appropriate, then one way to isolate the true value of *b*, and thus the systematic or predictable part of the series, is to compute a kind of moving average, where the current and immediately preceding (“younger”) observations are assigned greater weight than the respective older observations. Simple exponential smoothing accomplishes exactly such weighting, where exponentially smaller weights are assigned to older observations. The specific formula for simple exponential smoothing is:

S_{t} = *X_{t} + (1-)*S_{t-1}

When applied recursively to each successive observation in the series, each new smoothed value (forecast) is computed as the weighted average of the current observation and the previous smoothed observation; the previous smoothed observation was computed in turn from the previous observed value and the smoothed value before the previous observation, and so on. Thus, in effect, each smoothed value is the weighted average of the previous observations, where the weights decrease exponentially depending on the value of parameter (alpha). If is equal to 1 (one) then the previous observations are ignored entirely; if is equal to 0 (zero), then the current observation is ignored entirely, and the smoothed value consists entirely of the previous smoothed value (which in turn is computed from the smoothed observation before it, and so on; thus all smoothed values will be equal to the initial smoothed value *S _{0}*). Values of in-between will produce intermediate results.

Even though significant work has been done to study the theoretical properties of (simple and complex) exponential smoothing (e.g., see Gardner, 1985; Muth, 1960; see also McKenzie, 1984, 1985), the method has gained popularity mostly because of its usefulness as a forecasting tool. For example, empirical research by Makridakis *et al*. (1982, Makridakis, 1983), has shown simple exponential smoothing to be the best choice for one-period-ahead forecasting, from among 24 other time series methods and using a variety of accuracy measures (see also Gross and Craig, 1974, for additional empirical evidence). Thus, regardless of the theoretical model for the process underlying the observed time series, simple exponential smoothing will often produce quite accurate forecasts.

### Choosing the Best Value for Parameter (alpha)

Gardner (1985) discusses various theoretical and empirical arguments for selecting an appropriate smoothing parameter. Obviously, looking at the formula presented above, should fall into the interval between 0 (zero) and 1 (although, see Brenner *et al.*, 1968, for an ARIMA perspective, implying 0<<2). Gardner (1985) reports that among practitioners, an smaller than .30 is usually recommended. However, in the study by Makridakis *et al*. (1982), values above .30 frequently yielded the best forecasts. After reviewing the literature on this topic, Gardner (1985) concludes that it is best to estimate an optimum from the data (see below), rather than to “guess” and set an artificially low value.

**Estimating the best value from the data. **In practice, the smoothing parameter is often chosen by a *grid search* of the parameter space; that is, different solutions for are tried starting, for example, with = 0.1 to = 0.9, with increments of 0.1. Then is chosen so as to produce the smallest sums of squares (or mean squares) for the residuals (i.e., observed values minus one-step-ahead forecasts; this mean squared error is also referred to as *ex post* mean squared error, *ex post* MSE for short).

### Indices of Lack of Fit (Error)

The most straightforward way of evaluating the accuracy of the forecasts based on a particular value is to simply plot the observed values and the one-step-ahead forecasts. This plot can also include the residuals (scaled against the right *Y*-axis), so that regions of better or worst fit can also easily be identified.

This visual check of the accuracy of forecasts is often the most powerful method for determining whether or not the current exponential smoothing model fits the data. In addition, besides the *ex post* MSE criterion (see previous paragraph), there are other statistical measures of error that can be used to determine the optimum parameter (see Makridakis, Wheelwright, and McGee, 1983):

**Mean error: **The mean error (ME) value is simply computed as the average error value (average of observed minus one-step-ahead forecast). Obviously, a drawback of this measure is that positive and negative error values can cancel each other out, so this measure is not a very good indicator of overall fit.

**Mean absolute error: **The mean absolute error (MAE) value is computed as the average *absolute* error value. If this value is 0 (zero), the fit (forecast) is perfect. As compared to the mean *squared* error value, this measure of fit will “de-emphasize” outliers, that is, unique or rare large error values will affect the MAE less than the MSE value.

**Sum of squared error (SSE), Mean squared error. **These values are computed as the sum (or average) of the squared error values. This is the most commonly used lack-of-fit indicator in statistical fitting procedures.

**Percentage error (PE). **All the above measures rely on the actual error value. It may seem reasonable to rather express the lack of fit in terms of the *relative* deviation of the one-step-ahead forecasts from the observed values, that is, relative to the magnitude of the observed values. For example, when trying to predict monthly sales that may fluctuate widely (e.g., seasonally) from month to month, we may be satisfied if our prediction “hits the target” with about ±10% accuracy. In other words, the absolute errors may be not so much of interest as are the relative errors in the forecasts. To assess the relative error, various indices have been proposed (see Makridakis, Wheelwright, and McGee, 1983). The first one, the percentage error value, is computed as:

PE_{t} = 100*(X_{t} – F_{t} )/X_{t}

where *X _{t}* is the observed value at time

*t*, and

*F*is the forecasts (smoothed values).

_{t}**Mean percentage error (MPE). **This value is computed as the average of the PE values.

**Mean absolute percentage error (MAPE). **As is the case with the mean error value (ME, see above), a mean percentage error near 0 (zero) can be produced by large positive and negative percentage errors that cancel each other out. Thus, a better measure of relative overall fit is the mean *absolute* percentage error. Also, this measure is usually more meaningful than the mean squared error. For example, knowing that the average forecast is “off” by ±5% is a useful result in and of itself, whereas a mean squared error of 30.8 is not immediately interpretable.

**Automatic search for best parameter. **A quasi-Newton function minimization procedure (the same as in ARIMA is used to minimize either the mean squared error, mean absolute error, or mean absolute percentage error. In most cases, this procedure is more efficient than the grid search (particularly when more than one parameter must be determined), and the optimum parameter can quickly be identified.

**The first smoothed value S_{0}**. A final issue that we have neglected up to this point is the problem of the initial value, or how to start the smoothing process. If you look back at the formula above, it is evident that you need an

*S*value in order to compute the smoothed value (forecast) for the first observation in the series. Depending on the choice of the parameter (i.e., when is close to zero), the initial value for the smoothing process can affect the quality of the forecasts for many observations. As with most other aspects of exponential smoothing it is recommended to choose the initial value that produces the best forecasts. On the other hand, in practice, when there are many leading observations prior to a crucial actual forecast, the initial value will not affect that forecast by much, since its effect will have long “faded” from the smoothed series (due to the exponentially decreasing weights, the older an observation the less it will influence the forecast).

_{0}### Seasonal and Non-Seasonal Models With or Without Trend

The discussion above in the context of simple exponential smoothing introduced the basic procedure for identifying a smoothing parameter, and for evaluating the goodness-of-fit of a model. In addition to simple exponential smoothing, more complex models have been developed to accommodate time series with seasonal and trend components. The general idea here is that forecasts are not only computed from consecutive previous observations (as in simple exponential smoothing), but an independent (smoothed) trend and seasonal component can be added. Gardner (1985) discusses the different models in terms of seasonality (none, additive, or multiplicative) and trend (none, linear, exponential, or damped).

**Additive and multiplicative seasonality. **Many time series data follow recurring seasonal patterns. For example, annual sales of toys will probably peak in the months of November and December, and perhaps during the summer (with a much smaller peak) when children are on their summer break. This pattern will likely repeat every year, however, the relative amount of increase in sales during December may slowly change from year to year. Thus, it may be useful to smooth the seasonal component independently with an extra parameter, usually denoted as (*delta*).

Seasonal components can be additive in nature or multiplicative. For example, during the month of December the sales for a particular toy may increase by 1 million dollars every year. Thus, we could *add* to our forecasts for every December the amount of 1 million dollars (over the respective annual average) to account for this seasonal fluctuation. In this case, the seasonality is *additive*.

Alternatively, during the month of December the sales for a particular toy may increase by 40%, that is, increase by a *factor* of 1.4. Thus, when the sales for the toy are generally weak, than the absolute (dollar) increase in sales during December will be relatively weak (but the percentage will be constant); if the sales of the toy are strong, than the absolute (dollar) increase in sales will be proportionately greater. Again, in this case the sales increase by a certain *factor*, and the seasonal component is thus *multiplicative* in nature (i.e., the multiplicative seasonal component in this case would be 1.4).

In plots of the series, the distinguishing characteristic between these two types of seasonal components is that in the additive case, the series shows steady seasonal fluctuations, regardless of the overall level of the series; in the multiplicative case, the size of the seasonal fluctuations vary, depending on the overall level of the series.

**The seasonal smoothing parameter. **In general the one-step-ahead forecasts are computed as (for no trend models, for linear and exponential trend models a trend component is added to the model; see below):

Additive model:

Forecast_{t} = S_{t} + I_{t-p}

Multiplicative model:

Forecast_{t} = S_{t}*I_{t-p}

In this formula, *S _{t}* stands for the (simple) exponentially smoothed value of the series at time

*t*, and

*I*stands for the smoothed seasonal factor at time

_{t-p}*t*minus

*p*(the length of the season). Thus, compared to simple exponential smoothing, the forecast is “enhanced” by adding or multiplying the simple smoothed value by the predicted seasonal component. This seasonal component is derived analogous to the

*S*value from simple exponential smoothing as:

_{t}Additive model:

I_{t} = I_{t-p} + *(1-)*e_{t}

Multiplicative model:

I_{t} = I_{t-p} + *(1-)*e_{t}/S_{t}

Put into words, the predicted seasonal component at time *t* is computed as the respective seasonal component in the last seasonal cycle plus a portion of the error (*e _{t}*; the observed minus the forecast value at time

*t*). Considering the formulas above, it is clear that parameter can assume values between 0 and 1. If it is zero, then the seasonal component for a particular point in time is predicted to be identical to the predicted seasonal component for the respective time during the previous seasonal cycle, which in turn is predicted to be identical to that from the previous cycle, and so on. Thus, if is zero, a constant unchanging seasonal component is used to generate the one-step-ahead forecasts. If the parameter is equal to 1, then the seasonal component is modified “maximally” at every step by the respective forecast error (times (1-), which we will ignore for the purpose of this brief introduction). In most cases, when seasonality is present in the time series, the optimum parameter will fall somewhere between 0 (zero) and 1(one).

**Linear, exponential, and damped trend. **To remain with the toy example above, the sales for a toy can show a linear upward trend (e.g., each year, sales increase by 1 million dollars), exponential growth (e.g., each year, sales increase by a factor of 1.3), or a damped trend (during the first year sales increase by 1 million dollars; during the second year the increase is only 80% over the previous year, i.e., $800,000; during the next year it is again 80% less than the previous year, i.e., $800,000 * .8 = $640,000; etc.). Each type of trend leaves a clear “signature” that can usually be identified in the series; shown below in the brief discussion of the different models are icons that illustrate the general patterns. In general, the trend factor may change slowly over time, and, again, it may make sense to smooth the trend component with a separate parameter (denoted [*gamma*] for linear and exponential trend models, and [*phi*] for damped trend models).

**The trend smoothing parameters (linear and exponential trend) and (damped trend). **Analogous to the seasonal component, when a trend component is included in the exponential smoothing process, an independent trend component is computed for each time, and modified as a function of the forecast error and the respective parameter. If the parameter is 0 (zero), than the trend component is constant across all values of the time series (and for all forecasts). If the parameter is 1, then the trend component is modified “maximally” from observation to observation by the respective forecast error. Parameter values that fall in-between represent mixtures of those two extremes. Parameter is a trend modification parameter, and affects how strongly changes in the trend will affect estimates of the trend for subsequent forecasts, that is, how quickly the trend will be “damped” or increased.

## Classical Seasonal Decomposition (Census Method 1)

See also:

- Identifying Patterns in Time Series Data
- ARIMA (Box & Jenkins) and Autocorrelations
- Interrupted Time Series
- Exponential Smoothing
- X-11 Census method II seasonal adjustment
- X-11 Census method II result tables
- Distributed Lags Analysis
- Single Spectrum (Fourier) Analysis
- Cross-spectrum Analysis
- Basic Notations and Principles
- Fast Fourier Transformations

### General Introduction

Suppose you recorded the monthly passenger load on international flights for a period of 12 years ( see Box & Jenkins, 1976). If you plot those data, it is apparent that (1) there appears to be a linear upwards trend in the passenger loads over the years, and (2) there is a recurring pattern or *seasonality* within each year (i.e., most travel occurs during the summer months, and a minor peak occurs during the December holidays). The purpose of the seasonal decomposition method is to isolate those components, that is, to de-compose the series into the trend effect, seasonal effects, and remaining variability. The “classic” technique designed to accomplish this decomposition is known as the *Census I* method. This technique is described and discussed in detail in Makridakis, Wheelwright, and McGee (1983), and Makridakis and Wheelwright (1989).

**General model. **The general idea of seasonal decomposition is straightforward. In general, a time series like the one described above can be thought of as consisting of four different components: (1) A seasonal component (denoted as *S _{t}*, where

*t*stands for the particular point in time) (2) a trend component (

*T*), (3) a cyclical component (

_{t}*C*), and (4) a random, error, or irregular component (

_{t}*I*). The difference between a cyclical and a seasonal component is that the latter occurs at regular (seasonal) intervals, while cyclical factors have usually a longer duration that varies from cycle to cycle. In the Census I method, the trend and cyclical components are customarily combined into a

_{t}*trend-cycle component*(

*TC*). The specific functional relationship between these components can assume different forms. However, two straightforward possibilities are that they combine in an

_{t}*additive*or a

*multiplicative*fashion:

Additive model:

X_{t} = TC_{t} + S_{t} + I_{t}

Multiplicative model:

X_{t} = T_{t}*C_{t}*S_{t}*I_{t}

Here *X _{t}* stands for the observed value of the time series at time

*t*. Given some

*a priori*knowledge about the cyclical factors affecting the series (e.g., business cycles), the estimates for the different components can be used to compute forecasts for future observations. (However, the

*Exponential smoothing*method, which can also incorporate seasonality and trend components, is the preferred technique for forecasting purposes.)

**Additive and multiplicative seasonality**. Let’s consider the difference between an additive and multiplicative seasonal component in an example: The annual sales of toys will probably peak in the months of November and December, and perhaps during the summer (with a much smaller peak) when children are on their summer break. This seasonal pattern will likely repeat every year. Seasonal components can be additive or multiplicative in nature. For example, during the month of December the sales for a particular toy may increase by 3 million dollars every year. Thus, we could *add* to our forecasts for every December the amount of 3 million to account for this seasonal fluctuation. In this case, the seasonality is *additive*. Alternatively, during the month of December the sales for a particular toy may increase by 40%, that is, increase by a factor of 1.4. Thus, when the sales for the toy are generally weak, then the absolute (dollar) increase in sales during December will be relatively weak (but the percentage will be constant); if the sales of the toy are strong, then the absolute (dollar) increase in sales will be proportionately greater. Again, in this case the sales increase by a certain *factor*, and the seasonal component is thus *multiplicative* in nature (i.e., the multiplicative seasonal component in this case would be 1.4). In plots of series, the distinguishing characteristic between these two types of seasonal components is that in the additive case, the series shows steady seasonal fluctuations, regardless of the overall level of the series; in the multiplicative case, the size of the seasonal fluctuations vary, depending on the overall level of the series.

**Additive and multiplicative trend-cycle. **We can extend the previous example to illustrate the additive and multiplicative trend-cycle components. In terms of our toy example, a “fashion” *trend* may produce a steady increase in sales (e.g., a trend towards more educational toys in general); as with the seasonal component, this trend may be additive (sales increase by 3 million dollars per year) or multiplicative (sales increase by 30%, or by a factor of 1.3, annually) in nature. In addition, cyclical components may impact sales; to reiterate, a cyclical component is different from a seasonal component in that it usually is of longer duration, and that it occurs at irregular intervals. For example, a particular toy may be particularly “hot” during a summer season (e.g., a particular doll which is tied to the release of a major children’s movie, and is promoted with extensive advertising). Again such a cyclical component can effect sales in an additive manner or multiplicative manner.

### Computations

The *Seasonal Decomposition (Census I)* standard formulas are shown in Makridakis, Wheelwright, and McGee (1983), and Makridakis and Wheelwright (1989).

**Moving average. **First a moving average is computed for the series, with the moving average window width equal to the length of one season. If the length of the season is even, then the user can choose to use either equal weights for the moving average or unequal weights can be used, where the first and last observation in the moving average window are averaged.

**Ratios or differences. **In the moving average series, all seasonal (within-season) variability will be eliminated; thus, the differences (in additive models) or ratios (in multiplicative models) of the observed and smoothed series will isolate the seasonal component (plus irregular component). Specifically, the moving average is subtracted from the observed series (for additive models) or the observed series is divided by the moving average values (for multiplicative models).

**Seasonal components. **The seasonal component is then computed as the average (for additive models) or medial average (for multiplicative models) for each point in the season.

(The medial average of a set of values is the mean after the smallest and largest values are excluded). The resulting values represent the (average) seasonal component of the series.

**Seasonally adjusted series. **The original series can be adjusted by subtracting from it (additive models) or dividing it by (multiplicative models) the seasonal component.

The resulting series is the seasonally adjusted series (i.e., the seasonal component will be removed).

**Trend-cycle component. **Remember that the cyclical component is different from the seasonal component in that it is usually longer than one season, and different cycles can be of different lengths. The combined trend and cyclical component can be approximated by applying to the seasonally adjusted series a 5 point (centered) weighed moving average smoothing transformation with the weights of 1, 2, 3, 2, 1.

**Random or irregular component. **Finally, the random or irregular (error) component can be isolated by subtracting from the seasonally adjusted series (additive models) or dividing the adjusted series by (multiplicative models) the trend-cycle component.

## X-11 Census Method II Seasonal Adjustment

The general ideas of seasonal decomposition and adjustment are discussed in the context of the Census I seasonal adjustment method (*Seasonal Decomposition (Census I)). *The Census method II (2) is an extension and refinement of the simple adjustment method. Over the years, different versions of the Census method II evolved at the Census Bureau; the method that has become most popular and is used most widely in government and business is the so-called X-11 variant of the Census method II (see Hiskin, Young, & Musgrave, 1967). Subsequently, the term X-11 has become synonymous with this refined version of the Census method II. In addition to the documentation that can be obtained from the Census Bureau, a detailed summary of this method is also provided in Makridakis, Wheelwright, and McGee (1983) and Makridakis and Wheelwright (1989).

For more information on this method, see the following topics:

- Seasonal Adjustment: Basic Ideas and Terms
- The Census II Method
- Results Tables Computed by the X-11 Method
- Specific Description of all Results Tables Computed by the X-11 Method

For more information on other Time Series methods, see Time Series Analysis – Index and the following topics:

- Identifying Patterns in Time Series Data
- ARIMA (Box & Jenkins) and Autocorrelations
- Interrupted Time series
- Exponential Smoothing
- Seasonal Decomposition (Census I)
- Distributed Lags Analysis
- Single Spectrum (Fourier) Analysis
- Cross-spectrum Analysis
- Basic Notations and Principles
- Fast Fourier Transformations

### Seasonal Adjustment: Basic Ideas and Terms

Suppose you recorded the monthly passenger load on international flights for a period of 12 years ( see Box & Jenkins, 1976). If you plot those data, it is apparent that (1) there appears to be an upwards linear trend in the passenger loads over the years, and (2) there is a recurring pattern or *seasonality* within each year (i.e., most travel occurs during the summer months, and a minor peak occurs during the December holidays). The purpose of seasonal decomposition and adjustment is to isolate those components, that is, to de-compose the series into the trend effect, seasonal effects, and remaining variability. The “classic” technique designed to accomplish this decomposition was developed in the 1920’s and is also known as the *Census I* method (see the Census I overview section). This technique is also described and discussed in detail in Makridakis, Wheelwright, and McGee (1983), and Makridakis and Wheelwright (1989).

**General model. **The general idea of seasonal decomposition is straightforward. In general, a time series like the one described above can be thought of as consisting of four different components: (1) A seasonal component (denoted as *S _{t}*, where

*t*stands for the particular point in time) (2) a trend component (

*T*), (3) a cyclical component (

_{t}*C*), and (4) a random, error, or irregular component (

_{t}*I*). The difference between a cyclical and a seasonal component is that the latter occurs at regular (seasonal) intervals, while cyclical factors usually have a longer duration that varies from cycle to cycle. The trend and cyclical components are customarily combined into a

_{t}*trend-cycle component*(

*TC*). The specific functional relationship between these components can assume different forms. However, two straightforward possibilities are that they combine in an

_{t}*additive*or a

*multiplicative*fashion:

Additive Model:

X_{t} = TC_{t} + S_{t} + I_{t}

Multiplicative Model:

X_{t} = T_{t}*C_{t}*S_{t}*I_{t}

Where:

*X _{t}* represents the observed value of the time series at time

*t*.

Given some *a priori* knowledge about the cyclical factors affecting the series (e.g., business cycles), the estimates for the different components can be used to compute forecasts for future observations. (However, the *Exponential smoothing* method, which can also incorporate seasonality and trend components, is the preferred technique for forecasting purposes.)

**Additive and multiplicative seasonality. **Consider the difference between an additive and multiplicative seasonal component in an example: The annual sales of toys will probably peak in the months of November and December, and perhaps during the summer (with a much smaller peak) when children are on their summer break. This seasonal pattern will likely repeat every year. Seasonal components can be additive or multiplicative in nature. For example, during the month of December the sales for a particular toy may increase by 3 million dollars every year. Thus, you could *add* to your forecasts for every December the amount of 3 million to account for this seasonal fluctuation. In this case, the seasonality is *additive*.

Alternatively, during the month of December the sales for a particular toy may increase by 40%, that is, increase by a *factor* of 1.4. Thus, when the sales for the toy are generally weak, then the absolute (dollar) increase in sales during December will be relatively weak (but the percentage will be constant); if the sales of the toy are strong, then the absolute (dollar) increase in sales will be proportionately greater. Again, in this case the sales increase by a certain *factor*, and the seasonal component is thus *multiplicative* in nature (i.e., the multiplicative seasonal component in this case would be 1.4). In plots of series, the distinguishing characteristic between these two types of seasonal components is that in the additive case, the series shows steady seasonal fluctuations, regardless of the overall level of the series; in the multiplicative case, the size of the seasonal fluctuations vary, depending on the overall level of the series.

**Additive and multiplicative trend-cycle. **The previous example can be extended to illustrate the additive and multiplicative trend-cycle components. In terms of the toy example, a “fashion” *trend* may produce a steady increase in sales (e.g., a trend towards more educational toys in general); as with the seasonal component, this trend may be additive (sales increase by 3 million dollars per year) or multiplicative (sales increase by 30%, or by a factor of 1.3, annually) in nature. In addition, cyclical components may impact sales. To reiterate, a cyclical component is different from a seasonal component in that it usually is of longer duration, and that it occurs at irregular intervals. For example, a particular toy may be particularly “hot” during a summer season (e.g., a particular doll which is tied to the release of a major children’s movie, and is promoted with extensive advertising). Again such a cyclical component can effect sales in an additive manner or multiplicative manner.

### The Census II Method

The basic method for seasonal decomposition and adjustment outlined in the Basic Ideas and Terms topic can be refined in several ways. In fact, unlike many other time-series modeling techniques (e.g., ARIMA) which are grounded in some theoretical model of an underlying process, the *X-11* variant of the Census II method simply contains many *ad hoc* features and refinements, that over the years have proven to provide excellent estimates for many real-world applications (see Burman, 1979, Kendal & Ord, 1990, Makridakis & Wheelwright, 1989; Wallis, 1974). Some of the major refinements are listed below.

**Trading-day adjustment. **Different months have different numbers of days, and different numbers of trading-days (i.e., Mondays, Tuesdays, etc.). When analyzing, for example, monthly revenue figures for an amusement park, the fluctuation in the different numbers of Saturdays and Sundays (peak days) in the different months will surely contribute significantly to the variability in monthly revenues. The *X-11* variant of the Census II method allows the user to test whether such trading-day variability exists in the series, and, if so, to adjust the series accordingly.

**Extreme values. **Most real-world time series contain outliers, that is, extreme fluctuations due to rare events. For example, a strike may affect production in a particular month of one year. Such extreme outliers may bias the estimates of the seasonal and trend components. The *X-11* procedure includes provisions to deal with extreme values through the use of “statistical control principles,” that is, values that are above or below a certain range (expressed in terms of multiples of *sigma*, the standard deviation) can be modified or dropped before final estimates for the seasonality are computed.

**Multiple refinements. **The refinement for outliers, extreme values, and different numbers of trading-days can be applied more than once, in order to obtain successively improved estimates of the components. The *X-11* method applies a series of successive refinements of the estimates to arrive at the final trend-cycle, seasonal, and irregular components, and the seasonally adjusted series.

**Tests and summary statistics. **In addition to estimating the major components of the series, various summary statistics can be computed. For example, analysis of variance tables can be prepared to test the significance of seasonal variability and trading-day variability (see above) in the series; the *X-11* procedure will also compute the percentage change from month to month in the random and trend-cycle components. As the duration or span in terms of months (or quarters for quarterly *X-11*) increases, the change in the trend-cycle component will likely also increase, while the change in the random component should remain about the same. The width of the average span at which the changes in the random component are about equal to the changes in the trend-cycle component is called the *month (quarter) for cyclical dominance*, or MCD (QCD) for short. For example, if the MCD is equal to 2, then you can infer that over a 2-month span the trend-cycle will dominate the fluctuations of the irregular (random) component. These and various other results are discussed in greater detail below.

### Result Tables Computed by the *X-11* Method

The computations performed by the *X-11* procedure are best discussed in the context of the results tables that are reported. The adjustment process is divided into seven major steps, which are customarily labeled with consecutive letters A through G.

**Prior adjustment (monthly seasonal adjustment only).**Before any seasonal adjustment is performed on the monthly time series, various prior user- defined adjustments can be incorporated. The user can specify a second series that contains prior adjustment factors; the values in that series will either be subtracted (additive model) from the original series, or the original series will be divided by these values (multiplicative model). For multiplicative models, user-specified trading-day adjustment weights can also be specified. These weights will be used to adjust the monthly observations depending on the number of respective trading-days represented by the observation.**Preliminary estimation of trading-day variation (monthly X-11) and weights.**Next, preliminary trading-day adjustment factors (monthly*X-11*only) and weights for reducing the effect of extreme observations are computed.**Final estimation of trading-day variation and irregular weights (monthly**The adjustments and weights computed in*X- 11*).*B*above are then used to derive improved trend-cycle and seasonal estimates. These improved estimates are used to compute the final trading-day factors (monthly*X-11*only) and weights.**Final estimation of seasonal factors, trend-cycle, irregular, and seasonally adjusted series.**The final trading-day factors and weights computed in*C*above are used to compute the final estimates of the components.**Modified original, seasonally adjusted, and irregular series.**The original and final seasonally adjusted series, and the irregular component are modified for extremes. The resulting modified series allow the user to examine the stability of the seasonal adjustment.**Month (quarter) for cyclical dominance (MCD, QCD), moving average, and summary measures.**In this part of the computations, various summary measures (see below) are computed to allow the user to examine the relative importance of the different components, the average fluctuation from month-to-month (quarter-to-quarter), the average number of consecutive changes in the same direction (average number of runs), etc.**Charts.**Finally, you will compute various charts (graphs) to summarize the results. For example, the final seasonally adjusted series will be plotted, in chronological order, or by month (see below).

### Specific Description of all Result Tables Computed by the *X-11* Method

In each part *A* through *G* of the analysis (see Results Tables Computed by the *X-11* Method), different result tables are computed. Customarily, these tables are numbered, and also identified by a letter to indicate the respective part of the analysis. For example, table *B 11* shows the initial seasonally adjusted series; *C 11* is the refined seasonally adjusted series, and *D 11* is the final seasonally adjusted series. Shown below is a list of all available tables. Those tables identified by an asterisk (*) are not available (applicable) when analyzing quarterly series. (Also, for quarterly adjustment, some of the computations outlined below are slightly different; for example instead of a 12-term [monthly] moving average, a 4-term [quarterly] moving average is applied to compute the seasonal factors; the initial trend-cycle estimate is computed via a centered 4-term moving average, the final trend-cycle estimate in each part is computed by a 5-term Henderson average.)

Following the convention of the Bureau of the Census version of the *X-11* method, three levels of printout detail are offered: *Standard* (17 to 27 tables), *Long* (27 to 39 tables), and *Full* (44 to 59 tables). In the description of each table below, the letters *S, L,* and *F* are used next to each title to indicate, which tables will be displayed and/or printed at the respective setting of the output option. (For the charts, two levels of detail are available: *Standard* and *All*.)

See the table name below, to obtain more information about that table.

## Distributed Lags Analysis

For more information on other Time Series methods, see Time Series Analysis – Index and the following topics:

- Identifying Patterns in Time Series Data
- ARIMA (Box & Jenkins) and Autocorrelations ARIMA Introductory Overview
- Interrupted Time Series
- Exponential Smoothing
- Seasonal Decomposition (Census I)
- X-11 Census method II seasonal adjustment
- X-11 Census method II result tables
- Single Spectrum (Fourier) Analysis
- Cross-spectrum Analysis
- Basic Notations and Principles
- Fast Fourier Transformations

### General Purpose

Distributed lags analysis is a specialized technique for examining the relationships between variables that involve some delay. For example, suppose that you are a manufacturer of computer software, and you want to determine the relationship between the number of inquiries that are received, and the number of orders that are placed by your customers. You could record those numbers monthly for a one-year period, and then correlate the two variables. However, obviously inquiries will precede actual orders, and you can expect that the number of orders will follow the number of inquiries with some delay. Put another way, there will be a (time) *lagged* correlation between the number of inquiries and the number of orders that are received.

Time-lagged correlations are particularly common in econometrics. For example, the benefits of investments in new machinery usually only become evident after some time. Higher income will change people’s choice of rental apartments, however, this relationship will be lagged because it will take some time for people to terminate their current leases, find new apartments, and move. In general, the relationship between capital appropriations and capital expenditures will be lagged, because it will require some time before investment decisions are actually acted upon.

In all of these cases, we have an independent or *explanatory* variable that affects the *dependent* variables with some lag. The distributed lags method allows you to investigate those lags.

Detailed discussions of distributed lags correlation can be found in most econometrics textbooks, for example, in Judge, Griffith, Hill, Luetkepohl, and Lee (1985), Maddala (1977), and Fomby, Hill, and Johnson (1984). In the following paragraphs we will present a brief description of these methods. We will assume that you are familiar with the concept of correlation (see *Basic Statistics*), and the basic ideas of multiple regression (see *Multiple Regression*).

### General Model

Suppose we have a dependent variable *y* and an independent or explanatory variable *x* which are both measured repeatedly over time. In some textbooks, the dependent variable is also referred to as the *endogenous* variable, and the independent or explanatory variable the *exogenous* variable. The simplest way to describe the relationship between the two would be in a simple linear relationship:

Y_{t} = _{i}*x_{t-i}

In this equation, the value of the dependent variable at time *t* is expressed as a linear function of *x* measured at times *t, t-1, t-2, *etc. Thus, the dependent variable is a linear function of x, and x is lagged by 1, 2, etc. time periods. The beta weights (_{i}) can be considered slope parameters in this equation. You may recognize this equation as a special case of the general linear regression equation (see the* **Multiple Regression* overview). If the weights for the lagged time periods are statistically significant, we can conclude that the y variable is predicted (or explained) with the respective lag.

### Almon Distributed Lag

A common problem that often arises when computing the weights for the multiple linear regression model shown above is that the values of adjacent (in time) values in the *x* variable are highly correlated. In extreme cases, their independent contributions to the prediction of y may become so redundant that the correlation matrix of measures can no longer be inverted, and thus, the *beta* weights cannot be computed. In less extreme cases, the computation of the *beta* weights and their standard errors can become very imprecise, due to round-off error. In the context of *Multiple Regression* this general computational problem is discussed as the *multicollinearity* or *matrix ill-conditioning* issue.

Almon (1965) proposed a procedure that will reduce the multicollinearity in this case. Specifically, suppose we express each weight in the linear regression equation in the following manner:

_{i} = _{0} + _{1}*i + … + _{q}*i^{q}

Almon could show that in many cases it is easier (i.e., it avoids the multicollinearity problem) to estimate the *alpha* values than the *beta* weights directly. Note that with this method, the precision of the beta weight estimates is dependent on the degree or order of the *polynomial approximation*.

**Misspecifications. **A general problem with this technique is that, of course, the lag length and correct polynomial degree are not known *a priori*. The effects of misspecifications of these parameters are potentially serious (in terms of biased estimation). This issue is discussed in greater detail in Frost (1975), Schmidt and Waud (1973), Schmidt and Sickles (1975), and Trivedi and Pagan (1979).

## Single Spectrum (Fourier) Analysis

Spectrum analysis is concerned with the exploration of cyclical patterns of data. The purpose of the analysis is to decompose a complex time series with cyclical components into a few underlying sinusoidal (sine and cosine) functions of particular wavelengths. The term “spectrum” provides an appropriate metaphor for the nature of this analysis: Suppose you study a beam of white sun light, which at first looks like a random (white noise) accumulation of light of different wavelengths. However, when put through a prism, we can separate the different wave lengths or cyclical components that make up white sun light. In fact, via this technique we can now identify and distinguish between different sources of light. Thus, by identifying the important underlying cyclical components, we have learned something about the phenomenon of interest. In essence, performing spectrum analysis on a time series is like putting the series through a prism in order to identify the wave lengths and importance of underlying cyclical components. As a result of a successful analysis, you might uncover just a few recurring cycles of different lengths in the time series of interest, which at first looked more or less like random noise.

A much cited example for spectrum analysis is the cyclical nature of sun spot activity (e.g., see Bloomfield, 1976, or Shumway, 1988). It turns out that sun spot activity varies over 11 year cycles. Other examples of celestial phenomena, weather patterns, fluctuations in commodity prices, economic activity, etc. are also often used in the literature to demonstrate this technique. To contrast this technique with ARIMA or Exponential Smoothing, the purpose of spectrum analysis is to identify the seasonal fluctuations of different lengths, while in the former types of analysis, the length of the seasonal component is usually known (or guessed) *a priori* and then included in some theoretical model of moving averages or autocorrelations.

The classic text on spectrum analysis is Bloomfield (1976); however, other detailed discussions can be found in Jenkins and Watts (1968), Brillinger (1975), Brigham (1974), Elliott and Rao (1982), Priestley (1981), Shumway (1988), or Wei (1989).

For more information, see Time Series Analysis – Index and the following topics:

- Basic Notations and Principles
- Fast Fourier Transformations
- Identifying Patterns in Time Series Data
- ARIMA (Box & Jenkins) and Autocorrelations ARIMA Introductory Overview
- Interrupted Time Series
- Distributed Lags Analysis
- Seasonal Decomposition (Census I)
- Exponential Smoothing
- Cross-spectrum Analysis

## Cross-Spectrum Analysis

- General Introduction
- Basic Notation and Principles
- Results for Each Variable
- The Cross-periodogram, Cross-density, Quadrature-density, and Cross-amplitude
- Squared Coherency, Gain, and Phase Shift
- How the Example Data were Created

For more information, see Time Series Analysis – Index and the following topics:

- Identifying Patterns in Time Series Data
- ARIMA (Box & Jenkins) and Autocorrelations ARIMA Introductory Overview
- Interrupted Time Series
- Exponential Smoothing Seasonal Decomposition (Census I)
- X-11 Census method II seasonal adjustment
- X-11 Census method II result tables
- Distributed Lags analysis
- Single Spectrum (Fourier) Analysis
- Basic Notations and Principles
- Fast Fourier Transformations

### General Introduction

Cross-spectrum analysis is an extension of *Single Spectrum (Fourier) Analysis* to the simultaneous analysis of two series. In the following paragraphs, we will assume that you have already read the introduction to single spectrum analysis. Detailed discussions of this technique can be found in Bloomfield (1976), Jenkins and Watts (1968), Brillinger (1975), Brigham (1974), Elliott and Rao (1982), Priestley (1981), Shumway (1988), or Wei (1989).

**Strong periodicity in the series at the respective frequency. **A much cited example for spectrum analysis is the cyclical nature of sun spot activity (e.g., see Bloomfield, 1976, or Shumway, 1988). It turns out that sun spot activity varies over 11 year cycles. Other examples of celestial phenomena, weather patterns, fluctuations in commodity prices, economic activity, etc. are also often used in the literature to demonstrate this technique.

The purpose of cross-spectrum analysis is to uncover the correlations between two series at different frequencies. For example, sun spot activity may be related to weather phenomena here on earth. If so, then if we were to record those phenomena (e.g., yearly average temperature) and submit the resulting series to a cross-spectrum analysis together with the sun spot data, we may find that the weather indeed correlates with the sunspot activity at the 11 year cycle. That is, we may find a periodicity in the weather data that is “in-sync” with the sun spot cycles. We can easily think of other areas of research where such knowledge could be very useful; for example, various economic indicators may show similar (correlated) cyclical behavior; various physiological measures likely will also display “coordinated” (i.e., correlated) cyclical behavior, and so on.

### Basic Notation and Principles

**A simple example**

Consider the following two series with 16 cases:

VAR1 | VAR2 | |
---|---|---|

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
1.000 1.637 1.148 -.058 -.713 -.383 .006 -.483 -1.441 -1.637 -.707 .331 .441 -.058 -.006 .924 |
-.058 -.713 -.383 .006 -.483 -1.441 -1.637 -.707 .331 .441 -.058 -.006 .924 1.713 1.365 .266 |

At first sight it is not easy to see the relationship between the two series. However, as shown below the series were created so that they would contain two strong correlated periodicities. Shown below are parts of the summary from the cross-spectrum analysis (the spectral estimates were smoothed with a Parzen window of width 3).

Indep.(X): VAR1 Dep.(Y): VAR2 |
||||||
---|---|---|---|---|---|---|

Frequncy |
Period |
X Density |
Y Density |
Cross Density |
Cross Quad |
Cross Amplit. |

0.000000 .062500 .125000 .187500 .250000 .312500 .375000 .437500 .500000 |
16.00000 8.00000 5.33333 4.00000 3.20000 2.66667 2.28571 2.00000 |
.000000 8.094709 .058771 3.617294 .333005 .091897 .052575 .040248 .037115 |
.024292 7.798284 .100936 3.845154 .278685 .067630 .036056 .026633 0.000000 |
-.00000 2.35583 -.04755 -2.92645 -.26941 -.07435 -.04253 -.03256 0.00000 |
0.00000 -7.58781 .06059 2.31191 .14221 .02622 .00930 .00342 0.00000 |
.000000 7.945114 .077020 3.729484 .304637 .078835 .043539 .032740 0.000000 |

### Results for Each Variable

The complete summary contains all spectrum statistics computed for each variable, as described in the *Single Spectrum (Fourier) Analysis* overview section. Looking at the results shown above, it is clear that both variables show strong periodicities at the frequencies .0625 and .1875.

### Cross-Periodogram, Cross-Density, Quadrature-Density, Cross-Amplitude

Analogous to the results for the single variables, the complete summary will also display periodogram values for the cross periodogram. However, the cross-spectrum consists of complex numbers that can be divided into a real and an imaginary part. These can be smoothed to obtain the cross-density and quadrature density (quad density for short) estimates, respectively. (The reasons for smoothing, and the different common weight functions for smoothing are discussed in the *Single Spectrum (Fourier) Analysis*.) The square root of the sum of the squared cross-density and quad-density values is called the *cross- amplitude*. The cross-amplitude can be interpreted as a measure of covariance between the respective frequency components in the two series. Thus we can conclude from the results shown in the table above that the .0625 and .1875 frequency components in the two series covary.

### Squared Coherency, Gain, and Phase Shift

There are additional statistics that can be displayed in the complete summary.

**Squared coherency. **You can standardize the cross-amplitude values by squaring them and dividing by the product of the spectrum density estimates for each series. The result is called the *squared coherency*, which can be interpreted similar to the squared correlation coefficient (see *Correlations – Overview*), that is, the coherency value is the squared correlation between the cyclical components in the two series at the respective frequency. However, the coherency values should not be interpreted by themselves; for example, when the spectral density estimates in both series are very small, large coherency values may result (the divisor in the computation of the coherency values will be very small), even though there are no strong cyclical components in either series at the respective frequencies.

**Gain. **The gain value is computed by dividing the cross-amplitude value by the spectrum density estimates for one of the two series in the analysis. Consequently, two gain values are computed, which can be interpreted as the standard least squares regression coefficients for the respective frequencies.

**Phase shift. **Finally, the phase shift estimates are computed as tan**-1 of the ratio of the quad density estimates over the cross-density estimate. The phase shift estimates (usually denoted by the Greek letter ) are measures of the extent to which each frequency component of one series leads the other.

### How the Example Data were Created

Now, let’s return to the example data set presented above. The large spectral density estimates for both series, and the cross-amplitude values at frequencies = 0.0625 and = .1875 suggest two strong synchronized periodicities in both series at those frequencies. In fact, the two series were created as:

v1 = cos(2**.0625*(v0-1)) + .75*sin(2**.2*(v0-1))

v2 = cos(2**.0625*(v0+2)) + .75*sin(2**.2*(v0+2))

(where *v0* is the case number). Indeed, the analysis presented in this overview reproduced the periodicity “inserted” into the data very well.

## Spectrum Analysis – Basic Notation and Principles

- Frequency and Period
- The General Structural Model
- A Simple Example
- Periodogram
- The Problem of Leakage
- Padding the Time Series
- Tapering
- Data Windows and Spectral Density Estimates
- Preparing the Data for Analysis
- Results when no Periodicity in the Series Exists

For more information, see Time Series Analysis – Index and the following topics:

- Identifying Patterns in Time Series Data
- ARIMA (Box & Jenkins) and Autocorrelations ARIMA Introductory Overview
- Interrupted Time Series
- Exponential Smoothing
- Seasonal Decomposition (Census I)
- X-11 Census method II seasonal adjustment
- X-11 Census method II result tables
- Distributed Lags Analysis
- Single Spectrum (Fourier) Analysis
- Cross-spectrum Analysis
- Fast Fourier Transformations

### Frequency and Period

The “wave length” of a sine or cosine function is typically expressed in terms of the number of cycles per unit time (*Frequency*), often denoted by the Greek letter *nu* ( ; some textbooks also use *f*). For example, the number of letters handled in a post office may show 12 cycles per year: On the first of every month a large amount of mail is sent (many bills come due on the first of the month), then the amount of mail decreases in the middle of the month, then it increases again towards the end of the month. Therefore, every month the fluctuation in the amount of mail handled by the post office will go through a full cycle. Thus, if the unit of analysis is one year, then n would be equal to 12, as there would be 12 cycles per year. Of course, there will likely be other cycles with different frequencies. For example, there might be annual cycles ( =1), and perhaps weekly cycles <(=52 weeks per year).

The *period T* of a sine or cosine function is defined as the length of time required for one full cycle. Thus, it is the reciprocal of the frequency, or: *T* = 1/. To return to the mail example in the previous paragraph, the monthly cycle, expressed in yearly terms, would be equal to 1/12 = 0.0833. Put into words, there is a period in the series of length 0.0833 years.

### The General Structural Model

As mentioned before, the purpose of spectrum analysis is to decompose the original series into underlying sine and cosine functions of different frequencies, in order to determine those that appear particularly strong or important. One way to do so would be to cast the issue as a linear *Multiple Regression* problem, where the dependent variable is the observed time series, and the independent variables are the sine functions of all possible (discrete) frequencies. Such a linear multiple regression model can be written as:

x_{t} = a_{0} + [a_{k}*cos(_{k}*t) + b_{k}*sin(_{k}*t)] (for k = 1 to q)

Following the common notation from classical harmonic analysis, in this equation (lambda) is the frequency expressed in terms of radians per unit time, that is: = 2**_{k}, where is the constant *pi*=3.14… and _{k} = k/q. What is important here is to recognize that the computational problem of fitting sine and cosine functions of different lengths to the data can be considered in terms of multiple linear regression. Note that the cosine parameters *a _{k}* and sine parameters

*b*are regression coefficients that tell us the degree to which the respective functions are correlated with the data. Overall there are

_{k}*q*different sine and cosine functions; intuitively (as also discussed in

*Multiple Regression*), it should be clear that we cannot have more sine and cosine functions than there are data points in the series. Without going into detail, if there are

*N*data points in the series, then there will be

*N/2+1*cosine functions and

*N/2-1*sine functions. In other words, there will be as many different sinusoidal waves as there are data points, and we will be able to completely reproduce the series from the underlying functions. (Note that if the number of cases in the series is odd, then the last data point will usually be ignored; in order for a sinusoidal function to be identified, you need at least two points: the high peak and the low peak.)

To summarize, spectrum analysis will identify the correlation of sine and cosine functions of different frequency with the observed data. If a large correlation (sine or cosine coefficient) is identified, you can conclude that there is a strong periodicity of the respective frequency (or period) in the data.

**Complex numbers (real and imaginary numbers). **In many textbooks on spectrum analysis, the structural model shown above is presented in terms of complex numbers, that is, the parameter estimation process is described in terms of the Fourier transform of a series into real and imaginary parts. Complex numbers are the superset that includes all real and imaginary numbers. Imaginary numbers, by definition, are numbers that are multiplied by the constant *i*, where *i* is defined as the square root of -1. Obviously, the square root of -1 does not exist, hence the term *imaginary* number; however, meaningful arithmetic operations on imaginary numbers can still be performed (e.g., [i*2]**2= -4). It is useful to think of real and imaginary numbers as forming a two dimensional plane, where the horizontal or *X*-axis represents all real numbers, and the vertical or *Y*-axis represents all imaginary numbers. Complex numbers can then be represented as points in the two- dimensional plane. For example, the complex number 3+i*2 can be represented by a point with coordinates {3,2} in this plane. You can also think of complex numbers as angles, for example, you can connect the point representing a complex number in the plane with the origin (complex number 0+i*0), and measure the angle of that vector to the horizontal line. Thus, intuitively you can see how the spectrum decomposition formula shown above, consisting of sine and cosine functions, can be rewritten in terms of operations on complex numbers. In fact, in this manner the mathematical discussion and required computations are often more elegant and easier to perform; which is why many textbooks prefer the presentation of spectrum analysis in terms of complex numbers.

### A Simple Example

Shumway (1988) presents a simple example to clarify the underlying “mechanics” of spectrum analysis. Let’s create a series with 16 cases following the equation shown above, and then see how we may “extract” the information that was put in it. First, create a variable and define it as:

x = 1*cos(2**.0625*(v0-1)) + .75*sin(2**.2*(v0-1))

This variable is made up of two underlying periodicities: The first at the frequency of =.0625 (or period 1/=16; one observation completes 1/16’th of a full cycle, and a full cycle is completed every 16 observations) and the second at the frequency of =.2 (or period of 5). The cosine coefficient (1.0) is larger than the sine coefficient (.75). The spectrum analysis summary is shown below.

Spectral analysis:VAR1 (shumex.sta) No. of cases: 16 |
|||||
---|---|---|---|---|---|

t |
Freq- uency |
Period |
Cosine Coeffs |
Sine Coeffs |
Period- ogram |

0 1 2 3 4 5 6 7 8 |
.0000 .0625 .1250 .1875 .2500 .3125 .3750 .4375 .5000 |
16.00 8.00 5.33 4.00 3.20 2.67 2.29 2.00 |
.000 1.006 .033 .374 -.144 -.089 -.075 -.070 -.068 |
0.000 .028 .079 .559 -.144 -.060 -.031 -.014 0.000 |
.000 8.095 .059 3.617 .333 .092 .053 .040 .037 |

Let’s now review the columns. Clearly, the largest cosine coefficient can be found for the .0625 frequency. A smaller sine coefficient can be found at frequency = .1875. Thus, clearly the two sine/cosine frequencies which were “inserted” into the example data file are reflected in the above table.

### Periodogram

The sine and cosine functions are mutually independent (or orthogonal); thus we may sum the squared coefficients for each frequency to obtain the *periodogram*. Specifically, the periodogram values above are computed as:

P_{k} = sine coefficient_{k}^{2} + cosine coefficient_{k}^{2} * N/2

where *P _{k}* is the periodogram value at frequency

_{k}and

*N*is the overall length of the series. The periodogram values can be interpreted in terms of variance (sums of squares) of the data at the respective frequency or period. Customarily, the periodogram values are plotted against the frequencies or periods.

### The Problem of Leakage

In the example above, a sine function with a frequency of 0.2 was “inserted” into the series. However, because of the length of the series (16), none of the frequencies reported exactly “hits” on that frequency. In practice, what often happens in those cases is that the respective frequency will “leak” into adjacent frequencies. For example, you may find large periodogram values for two adjacent frequencies, when, in fact, there is only one strong underlying sine or cosine function at a frequency that falls in-between those implied by the length of the series. There are three ways in which we can approach the problem of leakage:

- By padding the series, we may apply a finer frequency “roster” to the data,
- By
*tapering*the series prior to the analysis, we may reduce leakage, or - By smoothing the periodogram, we may identify the general frequency “regions” or (
*spectral densities*) that significantly contribute to the cyclical behavior of the series.

See below for descriptions of each of these approaches.

### Padding the Time Series

Because the frequency values are computed as *N/t* (the number of units of times), we can simply *pad* the series with a constant (e.g., zeros) and thereby introduce smaller increments in the frequency values. In a sense, padding allows us to apply a finer roster to the data. In fact, if we padded the example data file described in the example above with ten zeros, the results would not change, that is, the largest periodogram peaks would still occur at the frequency values closest to .0625 and .2. (Padding is also often desirable for computational efficiency reasons; see below.)

### Tapering

The so-called process of *split-cosine-bell tapering* is a recommended transformation of the series prior to the spectrum analysis. It usually leads to a reduction of leakage in the periodogram. The rationale for this transformation is explained in detail in Bloomfield (1976, p. 80-94). In essence, a proportion (*p*) of the data at the beginning and at the end of the series is transformed via multiplication by the weights:

w_{t} = 0.5*{1-cos[*(t – 0.5)/m]} (for t=0 to m-1)

w_{t} = 0.5*{1-cos[*(N – t + 0.5)/m]} (for t=N-m to N-1)

where *m* is chosen so that 2**m/N* is equal to the proportion of data to be tapered (*p*).

### Data Windows and Spectral Density Estimates

In practice, when analyzing actual data, it is usually not of crucial importance to identify exactly the frequencies for particular underlying sine or cosine functions. Rather, because the periodogram values are subject to substantial random fluctuation, we are faced with the problem of very many “chaotic” periodogram spikes. In that case, we want to find the frequencies with the greatest *spectral densities*, that is, the frequency regions, consisting of many adjacent frequencies, that contribute most to the overall periodic behavior of the series. This can be accomplished by smoothing the periodogram values via a weighted moving average transformation. Suppose the moving average window is of width *m* (which must be an odd number); the following are the most commonly used smoothers (note: *p = (m-1)/2*).

**Daniell (or equal weight) window. **The Daniell window (Daniell 1946) amounts to a simple (equal weight) moving average transformation of the periodogram values, that is, each spectral density estimate is computed as the mean of the *m/2* preceding and subsequent periodogram values.

**Tukey window. **In the Tukey (Blackman and Tukey, 1958) or Tukey-Hanning window (named after Julius Von Hann), for each frequency, the weights for the weighted moving average of the periodogram values are computed as:

w_{j} = 0.5 + 0.5*cos(*j/p) (for j=0 to p)

w_{-j} = w_{j} (for j 0)

**Hamming window. **In the Hamming (named after R. W. Hamming) window or Tukey-Hamming window (Blackman and Tukey, 1958), for each frequency, the weights for the weighted moving average of the periodogram values are computed as:

w_{j} = 0.54 + 0.46*cos(*j/p) (for j=0 to p)

w_{-j} = w_{j} (for j 0)

**Parzen window. **In the Parzen window (Parzen, 1961), for each frequency, the weights for the weighted moving average of the periodogram values are computed as:

w_{j} = 1-6*(j/p)^{2} + 6*(j/p)^{3} (for j = 0 to p/2)

w_{j} = 2*(1-j/p)^{3} (for j = p/2 + 1 to p)

w_{-j} = w_{j} (for j 0)

**Bartlett window. **In the Bartlett window (Bartlett, 1950) the weights are computed as:

w_{j} = 1-(j/p) (for j = 0 to p)

w_{-j} = w_{j} (for j 0)

With the exception of the Daniell window, all weight functions will assign the greatest weight to the observation being smoothed in the center of the window, and increasingly smaller weights to values that are further away from the center. In many cases, all of these data windows will produce very similar results.

### Preparing the Data for Analysis

Let’s now consider a few other practical points in spectrum analysis. Usually, we want to subtract the mean from the series, and detrend the series (so that it is stationary) prior to the analysis. Otherwise the periodogram and density spectrum will mostly be “overwhelmed” by a very large value for the first cosine coefficient (for frequency 0.0). In a sense, the mean is a cycle of frequency 0 (zero) per unit time; that is, it is a constant. Similarly, a trend is also of little interest when we want to uncover the periodicities in the series. In fact, both of those potentially strong effects may mask the more interesting periodicities in the data, and thus both the mean and the trend (linear) should be removed from the series prior to the analysis. Sometimes, it is also useful to smooth the data prior to the analysis, in order to “tame” the random noise that may obscure meaningful periodic cycles in the periodogram.

### Results when No Periodicity in the Series Exists

Finally, what if there are no recurring cycles in the data, that is, if each observation is completely independent of all other observations? If the distribution of the observations follows the normal distribution, such a time series is also referred to as a *white noise* series (like the white noise you hear on the radio when tuned in-between stations). A white noise input series will result in periodogram values that follow an exponential distribution. Thus, by testing the distribution of periodogram values against the exponential distribution, you can test whether the input series is different from a white noise series. In addition, then you can also request to compute the Kolmogorov-Smirnov one-sample *d* statistic (see also *Nonparametrics and Distributions* for more details).

**Testing for white noise in certain frequency bands. **Note that you can also plot the periodogram values for a particular frequency range only. Again, if the input is a white noise series with respect to those frequencies (i.e., it there are no significant periodic cycles of those frequencies), then the distribution of the periodogram values should again follow an exponential distribution.

## Fast Fourier Transforms (FFT)

For more information, see Time Series Analysis – Index and the following topics:

- Identifying Patterns in Time Series Data
- ARIMA (Box & Jenkins) and Autocorrelations ARIMA Introductory Overview
- Interrupted Time Series
- Exponential Smoothing
- Seasonal Decomposition (Census I)
- X-11 Census method II seasonal adjustment
- X-11 Census method II result tables
- Distributed Lags Analysis
- Single Spectrum (Fourier) Analysis
- Cross-spectrum Analysis
- Basic Notations and Principles

### General Introduction

The interpretation of the results of spectrum analysis is discussed in the *Basic Notation and Principles* topic, however, we have not described how it is done computationally. Up until the mid-1960s the standard way of performing the spectrum decomposition was to use explicit formulae to solve for the sine and cosine parameters. The computations involved required at least N**2 (complex) multiplications. Thus, even with today’s high-speed computers , it would be very time consuming to analyze even small time series (e.g., 8,000 observations would result in at least 64 million multiplications).

The time requirements changed drastically with the development of the so-called *fast Fourier transform* algorithm, or *FFT* for short. In the mid-1960s, J.W. Cooley and J.W. Tukey (1965) popularized this algorithm which, in retrospect, had in fact been discovered independently by various individuals. Various refinements and improvements of this algorithm can be found in Monro (1975) and Monro and Branch (1976). Readers interested in the computational details of this algorithm may refer to any of the texts cited in the overview. Suffice it to say that via the FFT algorithm, the time to perform a spectral analysis is proportional to *N**log2(*N*) – a huge improvement.

However, a draw-back of the standard FFT algorithm is that the number of cases in the series must be equal to a power of 2 (i.e., 16, 64, 128, 256, …). Usually, this necessitated padding of the series, which, as described above, will in most cases not change the characteristic peaks of the periodogram or the spectral density estimates. In cases, however, where the time units are meaningful, such padding may make the interpretation of results more cumbersome.

### Computation of FFT in Time Series

The implementation of the FFT algorithm allows you to take full advantage of the savings afforded by this algorithm. On most standard computers, series with over 100,000 cases can easily be analyzed. However, there are a few things to remember when analyzing series of that size.

As mentioned above, the standard (and most efficient) FFT algorithm requires that the length of the input series is equal to a power of 2. If this is not the case, additional computations have to be performed. It will use the simple explicit computational formulas as long as the input series is relatively small, and the number of computations can be performed in a relatively short amount of time. For long time series, in order to still utilize the FFT algorithm, an implementation of the general approach described by Monro and Branch (1976) is used. This method requires significantly more storage space, however, series of considerable length can still be analyzed very quickly, even if the number of observations is not equal to a power of 2.

For time series of lengths not equal to a power of 2, we would like to make the following recommendations: If the input series is small to moderately sized (e.g., only a few thousand cases), then do not worry. The analysis will typically only take a few seconds anyway. In order to analyze moderately large and large series (e.g., over 100,000 cases), pad the series to a power of 2 and then taper the series during the exploratory part of your data analysis.

## Text Mining (Big Data, Unstructured Data)

## Text Mining Introductory Overview

The purpose of Text Mining is to process unstructured (textual) information, extract meaningful numeric indices from the text, and, thus, make the information contained in the text accessible to the various data mining (statistical and machine learning) algorithms. Information can be extracted to derive summaries for the words contained in the documents or to compute summaries for the documents based on the words contained in them. Hence, you can analyze words, clusters of words used in documents, etc., or you could analyze documents and determine similarities between them or how they are related to other variables of interest in the data mining project. In the most general terms, text mining will “turn text into numbers” (meaningful indices), which can then be incorporated in other analyses such as predictive data mining projects, the application of unsupervised learning methods (clustering), etc. These methods are described and discussed in great detail in the comprehensive overview work by Manning and Schütze (2002), and for an in-depth treatment of these and related topics as well as the history of this approach to text mining, we highly recommend that source.

## Typical Applications for Text Mining

Unstructured text is very common, and in fact may represent the majority of information available to a particular research or data mining project.

**Analyzing open-ended survey responses.** In survey research (e.g., marketing), it is not uncommon to include various open-ended questions pertaining to the topic under investigation. The idea is to permit respondents to express their “views” or opinions without constraining them to particular dimensions or a particular response format. This may yield insights into customers’ views and opinions that might otherwise not be discovered when relying solely on structured questionnaires designed by “experts.” For example, you may discover a certain set of words or terms that are commonly used by respondents to describe the pro’s and con’s of a product or service (under investigation), suggesting common misconceptions or confusion regarding the items in the study.

**Automatic processing of messages, emails, etc.** Another common application for text mining is to aid in the automatic classification of texts. For example, it is possible to “filter” out automatically most undesirable “junk email” based on certain terms or words that are not likely to appear in legitimate messages, but instead identify undesirable electronic mail. In this manner, such messages can automatically be discarded. Such automatic systems for classifying electronic messages can also be useful in applications where messages need to be routed (automatically) to the most appropriate department or agency; e.g., email messages with complaints or petitions to a municipal authority are automatically routed to the appropriate departments; at the same time, the emails are screened for inappropriate or obscene messages, which are automatically returned to the sender with a request to remove the offending words or content.

**Analyzing warranty or insurance claims, diagnostic interviews, etc.** In some business domains, the majority of information is collected in open-ended, textual form. For example, warranty claims or initial medical (patient) interviews can be summarized in brief narratives, or when you take your automobile to a service station for repairs, typically, the attendant will write some notes about the problems that you report and what you believe needs to be fixed. Increasingly, those notes are collected electronically, so those types of narratives are readily available for input into text mining algorithms. This information can then be usefully exploited to, for example, identify common clusters of problems and complaints on certain automobiles, etc. Likewise, in the medical field, open-ended descriptions by patients of their own symptoms might yield useful clues for the actual medical diagnosis.

**Investigating competitors by crawling their web sites.** Another type of potentially very useful application is to automatically process the contents of Web pages in a particular domain. For example, you could go to a Web page, and begin “crawling” the links you find there to process all Web pages that are referenced. In this manner, you could automatically derive a list of terms and documents available at that site, and hence quickly determine the most important terms and features that are described. It is easy to see how these capabilities could efficiently deliver valuable business intelligence about the activities of competitors.

## Approaches to Text Mining

To reiterate, text mining can be summarized as a process of “numericizing” text. At the simplest level, all words found in the input documents will be indexed and counted in order to compute a table of documents and words, i.e., a matrix of frequencies that enumerates the number of times that each word occurs in each document. This basic process can be further refined to exclude certain common words such as “the” and “a” (stop word lists) and to combine different grammatical forms of the same words such as “traveling,” “traveled,” “travel,” etc. (stemming). However, once a table of (unique) words (terms) by documents has been derived, all standard statistical and data mining techniques can be applied to derive dimensions or clusters of words or documents, or to identify “important” words or terms that best predict another outcome variable of interest.

**Using well-tested methods and understanding the results of text mining.** Once a data matrix has been computed from the input documents and words found in those documents, various well-known analytic techniques can be used for further processing those data including methods for clustering, factoring, or predictive data mining (see, for example, Manning and Schütze, 2002).

**“Black-box” approaches to text mining and extraction of concepts.** There are text mining applications which offer “black-box” methods to extract “deep meaning” from documents with little human effort (to first read and understand those documents). These text mining applications rely on proprietary algorithms for presumably extracting “concepts” from text, and may even claim to be able to summarize large numbers of text documents automatically, retaining the core and most important meaning of those documents. While there are numerous algorithmic approaches to extracting “meaning from documents,” this type of technology is very much still in its infancy, and the aspiration to provide meaningful automated summaries of large numbers of documents may forever remain elusive. We urge skepticism when using such algorithms because 1) if it is not clear to the user how those algorithms work, it cannot possibly be clear how to interpret the results of those algorithms, and 2) the methods used in those programs are not open to scrutiny, for example by the academic community and peer review and, hence, we simply don’t know how well they might perform in different domains. As a final thought on this subject, you may consider this concrete example: Try the various automated translation services available via the Web that can translate entire paragraphs of text from one language into another. Then translate some text, even simple text, from your native language to some other language and back, and review the results. Almost every time, the attempt to translate even short sentences to other languages and back while retaining the original meaning of the sentence produces humorous rather than accurate results. This illustrates the difficulty of automatically interpreting the meaning of text.

**Text mining as document search.** There is another type of application that is often described and referred to as “text mining” – the automatic search of large numbers of documents based on key words or key phrases. This is the domain of, for example, the popular internet search engines that have been developed over the last decade to provide efficient access to Web pages with certain content. While this is obviously an important type of application with many uses in any organization that needs to search very large document repositories based on varying criteria, it is very different from what has been described here.

## Issues and Considerations for “Numericizing” Text

**Large numbers of small documents vs. small numbers of large documents.** Examples of scenarios using large numbers of small or moderate sized documents were given earlier (e.g., analyzing warranty or insurance claims, diagnostic interviews, etc.). On the other hand, if your intent is to extract “concepts” from only a few documents that are very large (e.g., two lengthy books), then statistical analyses are generally less powerful because the “number of cases” (documents) in this case is very small while the “number of variables” (extracted words) is very large.

**Excluding certain characters, short words, numbers, etc.** Excluding numbers, certain characters, or sequences of characters, or words that are shorter or longer than a certain number of letters can be done before the indexing of the input documents starts. You may also want to exclude “rare words,” defined as those that only occur in a small percentage of the processed documents.

**Include lists, exclude lists (stop-words).** Specific list of words to be indexed can be defined; this is useful when you want to search explicitly for particular words, and classify the input documents based on the frequencies with which those words occur. Also, “stop-words,” i.e., terms that are to be excluded from the indexing can be defined. Typically, a default list of English stop words includes “the”, “a”, “of”, “since,” etc, i.e., words that are used in the respective language very frequently, but communicate very little unique information about the contents of the document.

**Synonyms and phrases.** Synonyms, such as “sick” or “ill”, or words that are used in particular phrases where they denote unique meaning can be combined for indexing. For example, “Microsoft Windows” might be such a phrase, which is a specific reference to the computer operating system, but has nothing to do with the common use of the term “Windows” as it might, for example, be used in descriptions of home improvement projects.

**Stemming algorithms.** An important pre-processing step before indexing of input documents begins is the stemming of words. The term “stemming” refers to the reduction of words to their roots so that, for example, different grammatical forms or declinations of verbs are identified and indexed (counted) as the same word. For example, stemming will ensure that both “traveling” and “traveled” will be recognized by the text mining program as the same word.

**Support for different languages.** Stemming, synonyms, the letters that are permitted in words, etc. are highly language dependent operations. Therefore, support for different languages is important.

## Transforming Word Frequencies

Once the input documents have been indexed and the initial word frequencies (by document) computed, a number of additional transformations can be performed to summarize and aggregate the information that was extracted.

**Log-frequencies.** First, various transformations of the frequency counts can be performed. The raw word or term frequencies generally reflect on how salient or important a word is in each document. Specifically, words that occur with greater frequency in a document are better descriptors of the contents of that document. However, it is not reasonable to assume that the word counts themselves are proportional to their importance as descriptors of the documents. For example, if a word occurs 1 time in document A, but 3 times in document B, then it is not necessarily reasonable to conclude that this word is 3 times as important a descriptor of document B as compared to document A. Thus, a common transformation of the raw word frequency counts (wf) is to compute:

f(wf) = 1+ log(wf), for wf > 0

This transformation will “dampen” the raw frequencies and how they will affect the results of subsequent computations.

**Binary frequencies.** Likewise, an even simpler transformation can be used that enumerates whether a term is used in a document; i.e.:

f(wf) = 1, for wf > 0

The resulting documents-by-words matrix will contain only 1s and 0s to indicate the presence or absence of the respective words. Again, this transformation will dampen the effect of the raw frequency counts on subsequent computations and analyses.

**Inverse document frequencies.** Another issue that you may want to consider more carefully and reflect in the indices used in further analyses are the relative document frequencies (df) of different words. For example, a term such as “guess” may occur frequently in all documents, while another term such as “software” may only occur in a few. The reason is that we might make “guesses” in various contexts, regardless of the specific topic, while “software” is a more semantically focused term that is only likely to occur in documents that deal with computer software. A common and very useful transformation that reflects both the specificity of words (document frequencies) as well as the overall frequencies of their occurrences (word frequencies) is the so-called inverse document frequency (for the i’th word and j’th document):

In this formula (see also formula 15.5 in Manning and Schütze, 2002), *N* is the total number of documents, and *dfi* is the document frequency for the *i*‘th word (the number of documents that include this word). Hence, it can be seen that this formula includes both the dampening of the simple word frequencies via the log function (described above), and also includes a weighting factor that evaluates to 0 if the word occurs in all documents *(log(N/N=1)=0)*, and to the maximum value when a word only occurs in a single document *(log(N/1)=log(N))*. It can easily be seen how this transformation will create indices that both reflect the relative frequencies of occurrences of words, as well as their semantic specificities over the documents included in the analysis.

## Latent Semantic Indexing via Singular Value Decomposition

As described above, the most basic result of the initial indexing of words found in the input documents is a frequency table with simple counts, i.e., the number of times that different words occur in each input document. Usually, we would transform those raw counts to indices that better reflect the (relative) “importance” of words and/or their semantic specificity in the context of the set of input documents (see the discussion of inverse document frequencies, above).

A common analytic tool for interpreting the “meaning” or “semantic space” described by the words that were extracted, and hence by the documents that were analyzed, is to create a mapping of the word and documents into a common space, computed from the word frequencies or transformed word frequencies (e.g., inverse document frequencies). In general, here is how it works:

Suppose you indexed a collection of customer reviews of their new automobiles (e.g., for different makes and models). You may find that every time a review includes the word “gas-mileage,” it also includes the term “economy.” Further, when reports include the word “reliability” they also include the term “defects” (e.g., make reference to “no defects”). However, there is no consistent pattern regarding the use of the terms “economy” and “reliability,” i.e., some documents include either one or both. In other words, these four words “gas-mileage” and “economy,” and “reliability” and “defects,” describe two independent dimensions – the first having to do with the overall operating cost of the vehicle, the other with the quality and workmanship. The idea of latent semantic indexing is to identify such underlying dimensions (of “meaning”), into which the words and documents can be mapped. As a result, we may identify the underlying (latent) themes described or discussed in the input documents, and also identify the documents that mostly deal with economy, reliability, or both. Hence, we want to map the extracted words or terms and input documents into a common latent semantic space.

**Singular value decomposition.** The use of singular value decomposition in order to extract a common space for the variables and cases (observations) is used in various statistical techniques, most notably in *Correspondence Analysis*. The technique is also closely related to Principal Components Analysis and *Factor Analysis*. In general, the purpose of this technique is to reduce the overall dimensionality of the input matrix (number of input documents by number of extracted words) to a lower-dimensional space, where each consecutive dimension represents the largest degree of variability (between words and documents) possible. Ideally, you might identify the two or three most salient dimensions, accounting for most of the variability (differences) between the words and documents and, hence, identify the latent semantic space that organizes the words and documents in the analysis. In some way, once such dimensions can be identified, you have extracted the underlying “meaning” of what is contained (discussed, described) in the documents.

## Incorporating Text Mining Results in Data Mining Projects

After significant (e.g., frequent) words have been extracted from a set of input documents, and/or after singular value decomposition has been applied to extract salient semantic dimensions, typically the next and most important step is to use the extracted information in a data mining project.

**Graphics (visual data mining methods).** Depending on the purpose of the analyses, in some instances the extraction of semantic dimensions alone can be a useful outcome if it clarifies the underlying structure of what is contained in the input documents. For example, a study of new car owners’ comments about their vehicles may uncover the salient dimensions in the minds of those drivers when they think about or consider their automobile (or how they “feel” about it). For marketing research purposes, that in itself can be a useful and significant result. You can use the graphics (e.g., 2D scatterplots or 3D scatterplots) to help you visualize and identify the semantic space extracted from the input documents.

**Clustering and factoring.** You can use cluster analysis methods to identify groups of documents (e.g., vehicle owners who described their new cars), to identify groups of similar input texts. This type of analysis also could be extremely useful in the context of market research studies, for example of new car owners. You can also use *Factor Analysis* and Principal Components and Classification Analysis (to factor analyze words or documents).

**Predictive data mining.** Another possibility is to use the raw or transformed word counts as predictor variables in predictive data mining projects.

## Survival/Failure Time Analysis

## General Information

These techniques were primarily developed in the medical and biological sciences, but they are also widely used in the social and economic sciences, as well as in engineering (reliability and failure time analysis).

Imagine that you are a researcher in a hospital who is studying the effectiveness of a new treatment for a generally terminal disease. The major variable of interest is the number of days that the respective patients survive. In principle, one could use the standard parametric and nonparametric statistics for describing the average survival, and for comparing the new treatment with traditional methods (see *Basic Statistics* and *Nonparametrics and Distribution Fitting*). However, at the end of the study there will be patients who survived over the entire study period, in particular among those patients who entered the hospital (and the research project) late in the study; there will be other patients with whom we will have lost contact. Surely, one would not want to exclude all of those patients from the study by declaring them to be missing data (since most of them are “survivors” and, therefore, they reflect on the success of the new treatment method). Those observations, which contain only partial information are called censored observations (e.g., “patient A survived at least 4 months before he moved away and we lost contact;” the term censoring was first used by Hald, 1949).

## Censored Observations

In general, censored observations arise whenever the dependent variable of interest represents the time to a terminal event, and the duration of the study is limited in time. Censored observations may occur in a number of different areas of research. For example, in the social sciences we may study the “survival” of marriages, high school drop-out rates (time to drop-out), turnover in organizations, etc. In each case, by the end of the study period, some subjects will still be married, will not have dropped out, or are still working at the same company; thus, those subjects represent censored observations.

In economics we may study the “survival” of new businesses or the “survival” times of products such as automobiles. In quality control research, it is common practice to study the “survival” of parts under stress (failure time analysis).

## Analytic Techniques

Essentially, the methods offered in *Survival Analysis* address the same research questions as many of the other procedures; however, all methods in *Survival Analysis* will handle censored data. The *life table, survival distribution*, and *Kaplan-Meier* survival function estimation are all descriptive methods for estimating the distribution of survival times from a sample. Several techniques are available for comparing the survival in two or more groups. Finally, *Survival Analysis* offers several regression models for estimating the relationship of (multiple) continuous variables to survival times.

## Life Table Analysis

The most straightforward way to describe the survival in a sample is to compute the *Life Table*. The life table technique is one of the oldest methods for analyzing survival (failure time) data (e.g., see Berkson & Gage, 1950; Cutler & Ederer, 1958; Gehan, 1969). This table can be thought of as an “enhanced” frequency distribution table. The distribution of survival times is divided into a certain number of intervals. For each interval we can then compute the number and proportion of cases or objects that entered the respective interval “alive,” the number and proportion of cases that failed in the respective interval (i.e., number of terminal events, or number of cases that “died”), and the number of cases that were lost or censored in the respective interval.

Based on those numbers and proportions, several additional statistics can be computed:

- Number of Cases at Risk
- Proportion Failing
- Proportion surviving
- Cumulative Proportion Surviving (Survival Function)
- Probability Density
- Hazard rate
- Median survival time
- Required sample sizes

**Number of Cases at Risk. **This is the number of cases that entered the respective interval alive, minus half of the number of cases lost or censored in the respective interval.

**Proportion Failing. **This proportion is computed as the ratio of the number of cases failing in the respective interval, divided by the number of cases at risk in the interval.

**Proportion Surviving. **This proportion is computed as 1 minus the proportion failing.

**Cumulative Proportion Surviving (Survival Function). **This is the cumulative proportion of cases surviving up to the respective interval. Since the probabilities of survival are assumed to be independent across the intervals, this probability is computed by multiplying out the probabilities of survival across all previous intervals. The resulting function is also called the *survivorship* or *survival function*.

**Probability Density. **This is the estimated probability of failure in the respective interval, computed per unit of time, that is:

F_{i} = (P_{i}-P_{i+1}) /h_{i}

In this formula, F_{i} is the respective probability density in the i’th interval, P_{i} is the estimated cumulative proportion surviving at the beginning of the i’th interval (at the end of interval i-1), P_{i+1} is the cumulative proportion surviving at the end of the i’th interval, and h_{i} is the width of the respective interval.

**Hazard Rate. **The hazard rate (the term was first used by Barlow, 1963) is defined as the probability per time unit that a case that has survived to the beginning of the respective interval will fail in that interval. Specifically, it is computed as the number of failures per time units in the respective interval, divided by the average number of surviving cases at the mid-point of the interval.

**Median Survival Time. **This is the survival time at which the cumulative survival function is equal to *0.5*. Other percentiles (25th and 75th percentile) of the cumulative survival function can be computed accordingly. Note that the 50th percentile (median) for the cumulative survival function is usually not the same as the point in time up to which 50% of the sample survived. (This would only be the case if there were no censored observations prior to this time).

**Required Sample Sizes. **In order to arrive at reliable estimates of the three major functions (survival, probability density, and hazard) and their standard errors at each time interval the minimum recommended sample size is 30.

## Distribution Fitting

**General Introduction. **In summary, the life table gives us a good indication of the distribution of failures over time. However, for predictive purposes it is often desirable to understand the shape of the underlying survival function in the population. The major distributions that have been proposed for modeling survival or failure times are the exponential (and linear exponential) distribution, the Weibull distribution of extreme events, and the Gompertz distribution.

**Estimation. **The parameter estimation procedure (for estimating the parameters of the theoretical survival functions) is essentially a least squares linear regression algorithm (see Gehan & Siddiqui, 1973). A linear regression algorithm can be used because all four theoretical distributions can be “made linear” by appropriate transformations. Such transformations sometimes produce different variances for the residuals at different times, leading to biased estimates.

**Goodness-of-Fit. **Given the parameters for the different distribution functions and the respective model, we can compute the likelihood of the data. One can also compute the likelihood of the data under the null model, that is, a model that allows for different hazard rates in each interval. Without going into details, these two likelihoods can be compared via an incremental *Chi-square* test statistic. If this *Chi-square* is statistically significant, then we conclude that the respective theoretical distribution fits the data significantly worse than the null model; that is, we reject the respective distribution as a model for our data.

**Plots. **You can produce plots of the survival function, hazard, and probability density for the observed data and the respective theoretical distributions. These plots provide a quick visual check of the goodness-of-fit of the theoretical distribution. The example plot below shows an observed survivorship function and the fitted Weibull distribution.

Specifically, the three lines in this plot denote the theoretical distributions that resulted from three different estimation procedures (least squares and two methods of weighted least squares).

## Kaplan-Meier Product-Limit Estimator

Rather than classifying the observed survival times into a life table, we can estimate the survival function directly from the continuous survival or failure times. Intuitively, imagine that we create a life table so that each time interval contains exactly one case. Multiplying out the survival probabilities across the “intervals” (i.e., for each single observation) we would get for the survival function:

S(t) = _{j}^{t}_{= 1} [(n-j)/(n-j+1)]^{( j )}

In this equation, *S(t)* is the estimated survival function, *n* is the total number of cases, and denotes the multiplication (geometric sum) across all cases less than or equal to *t*; *(j)* is a constant that is either *1* if the *j*‘th case is uncensored (complete), and *0* if it is censored. This estimate of the survival function is also called the *product-limit estimator*, and was first proposed by Kaplan and Meier (1958). An example plot of this function is shown below.

The advantage of the Kaplan-Meier Product-Limit method over the life table method for analyzing survival and failure time data is that the resulting estimates do not depend on the grouping of the data (into a certain number of time intervals). Actually, the Product-Limit method and the life table method are identical if the intervals of the life table contain at most one observation.

## Comparing Samples

- General Introduction
- Available tests
- Choosing a two-sample test
- Multiple sample test
- Unequal proportions of censored data

**General Introduction. **One can compare the survival or failure times in two or more samples. In principle, because survival times are not normally distributed, nonparametric tests that are based on the *rank ordering* of survival times should be applied. A wide range of nonparametric tests can be used in order to compare survival times; however, the tests cannot “handle” censored observations.

**Available Tests.** The following five different (mostly nonparametric) tests for censored data are available: Gehan’s generalized Wilcoxon test, the Cox-Mantel test, the Cox’s *F* test , the log-rank test, and Peto and Peto’s generalized Wilcoxon test. A nonparametric test for the comparison of multiple groups is also available. Most of these tests are accompanied by appropriate *z*– values (values of the standard normal distribution); these *z*-values can be used to test for the statistical significance of any differences between groups. However, note that most of these tests will only yield reliable results with fairly large samples sizes; the small sample “behavior” is less well understood.

**Choosing a Two Sample Test. **There are no widely accepted guidelines concerning which test to use in a particular situation. Cox’s *F* test tends to be more powerful than Gehan’s generalized Wilcoxon test when:

- Sample sizes are small (i.e.,
*n*per group less than 50); - If samples are from an exponential or Weibull;
- If there are no censored observations (see Gehan & Thomas, 1969).

Lee, Desu, and Gehan (1975) compared Gehan’s test to several alternatives and showed that the Cox-Mantel test and the log-rank test are more powerful (regardless of censoring) when the samples are drawn from a population that follows an exponential or Weibull distribution; under those conditions there is little difference between the Cox-Mantel test and the log-rank test. Lee (1980) discusses the power of different tests in greater detail.

**Multiple Sample Test. **There is a multiple-sample test that is an extension (or generalization) of Gehan’s generalized Wilcoxon test, Peto and Peto’s generalized Wilcoxon test, and the log-rank test. First, a score is assigned to each survival time using Mantel’s procedure (Mantel, 1967); next a *Chi- square* value is computed based on the sums (for each group) of this score. If only two groups are specified, then this test is equivalent to Gehan’s generalized Wilcoxon test, and the computations will default to that test in this case.

**Unequal Proportions of Censored Data. **When comparing two or more groups it is very important to examine the number of censored observations in each group. Particularly in medical research, censoring can be the result of, for example, the application of different treatments: patients who get better faster or get worse as the result of a treatment may be more likely to drop out of the study, resulting in different numbers of censored observations in each group. Such systematic censoring may greatly bias the results of comparisons.

## Regression Models

- General Introduction
- Cox’s Proportional Hazard Model
- Cox’s Proportional Hazard Model with Time-Dependent Covariates
- Exponential Regression
- Normal and Log-Normal Regression
- Stratified Analyses

### General Introduction

A common research question in medical, biological, or engineering (failure time) research is to determine whether or not certain continuous (independent) variables are correlated with the survival or failure times. There are two major reasons why this research issue cannot be addressed via straightforward multiple regression techniques (as available in Multiple Regression): First, the dependent variable of interest (survival/failure time) is most likely not normally distributed — a serious violation of an assumption for ordinary least squares multiple regression. Survival times usually follow an exponential or Weibull distribution. Second, there is the problem of censoring, that is, some observations will be incomplete.

### Cox’s Proportional Hazard Model

The proportional hazard model is the most general of the regression models because it is not based on any assumptions concerning the nature or shape of the underlying survival distribution. The model assumes that the underlying hazard *rate* (rather than survival time) is a function of the independent variables (covariates); no assumptions are made about the nature or shape of the hazard function. Thus, in a sense, Cox’s regression model may be considered to be a nonparametric method. The model may be written as:

h{(t), (z_{1}, z_{2}, …, z_{m})} = h_{0}(t)*exp(b_{1}*z_{1} + … + b_{m}*z_{m})

where *h(t,…)* denotes the resultant hazard, given the values of the *m *covariates for the respective case (*z _{1}*,

*z*, …,

_{2}*z*) and the respective survival time (

_{m}*t*). The term

*h*is called the

_{0}(t)*baseline hazard*; it is the hazard for the respective individual when all independent variable values are equal to zero. We can linearize this model by dividing both sides of the equation by

*h*and then taking the natural logarithm of both sides:

_{0}(t)log[h{(t), (z…)}/h_{0}(t)] = b_{1}*z_{1} + … + b_{m}*z_{m}

We now have a fairly “simple” linear model that can be readily estimated.

**Assumptions.** While no assumptions are made about the shape of the underlying hazard function, the model equations shown above do imply two assumptions. First, they specify a multiplicative relationship between the underlying hazard function and the log-linear function of the covariates. This assumption is also called the *proportionality assumption*. In practical terms, it is assumed that, given two observations with different values for the independent variables, the ratio of the hazard functions for those two observations does not depend on time. The second assumption of course, is that there is a log-linear relationship between the independent variables and the underlying hazard function.

### Cox’s Proportional Hazard Model with Time-Dependent Covariates

An assumption of the proportional hazard model is that the hazard function for an individual (i.e., observation in the analysis) depends on the values of the covariates and the value of the baseline hazard. Given two individuals with particular values for the covariates, the ratio of the estimated hazards over time will be constant — hence the name of the method: the *proportional hazard *model. The validity of this assumption may often be questionable. For example, age is often included in studies of physical health. Suppose you studied survival after surgery. It is likely, that age is a more important predictor of risk immediately after surgery, than some time after the surgery (after initial recovery). In accelerated life testing one sometimes uses a stress covariate (e.g., amount of voltage) that is slowly increased over time until failure occurs (e.g., until the electrical insulation fails; see Lawless, 1982, page 393). In this case, the impact of the covariate is clearly dependent on time. The user can specify arithmetic expressions to define covariates as functions of several variables and survival time.

**Testing the Proportionality Assumption.** As indicated by the previous examples, there are many applications where it is likely that the proportionality assumption does not hold. In that case, one can explicitly define covariates as functions of time. For example, the analysis of a data set presented by Pike (1966) consists of survival times for two groups of rats that had been exposed to a carcinogen (see also Lawless, 1982, page 393, for a similar example). Suppose that *z *is a grouping variable with codes *1 *and *0 *to denote whether or not the respective rat was exposed. One could then fit the proportional hazard model:

h(t,z) = h_{0}(t)*exp{b_{1}*z + b_{2}*[z*log(t)-5.4]}

Thus, in this model the conditional hazard at time *t *is a function of (1) the baseline hazard *h _{0}*, (2) the covariate

*z*, and (3) of

*z*times the logarithm of time. Note that the constant

*5.4*is used here for scaling purposes only: the mean of the logarithm of the survival times in this data set is equal to

*5.4*. In other words, the conditional hazard at each point in time is a function of the covariate and time; thus, the effect of the covariate on survival is dependent on time; hence the name

*time-dependent covariate*. This model allows one to specifically test the proportionality assumption. If parameter

*b*is statistically significant (e.g., if it is at least twice as large as its standard error), then one can conclude that, indeed, the effect of the covariate

_{2}*z*on survival is dependent on time, and, therefore, that the proportionality assumption does not hold.

### Exponential Regression

Basically, this model assumes that the survival time distribution is exponential, and contingent on the values of a set of independent variables (*z**i*). The rate parameter of the exponential distribution can then be expressed as:

S(z) = exp(a + b_{1}*z_{1} + b_{2}*z_{2} + … + b_{m}*z_{m})

*S(z)* denotes the survival times, *a* is a constant, and the *b _{i}*‘s are the regression parameters.

Goodness-of-fit. The *Chi-square* goodness-of-fit value is computed as a function of the log-likelihood for the model with all parameter estimates (*L**1*), and the log-likelihood of the model in which all covariates are forced to 0 (zero; *L**0*). If this *Chi-square* value is significant, we reject the null hypothesis and assume that the independent variables are significantly related to survival times.

Standard exponential order statistic. One way to check the exponentiality assumption of this model is to plot the residual survival times against the standard exponential order statistic *theta*. If the exponentiality assumption is met, then all points in this plot will be arranged roughly in a straight line.

### Normal and Log-Normal Regression

In this model, it is assumed that the survival times (or log survival times) come from a normal distribution; the resulting model is basically identical to the ordinary multiple regression model, and may be stated as:

t = a + b_{1}*z_{1} + b_{2}*z_{2} + … + b_{m}*z_{m}

where *t* denotes the survival times. For log-normal regression, *t* is replaced by its natural logarithm. The normal regression model is particularly useful because many data sets can be transformed to yield approximations of the normal distribution. Thus, in a sense this is the most general fully parametric model (as opposed to Cox’s proportional hazard model which is non-parametric), and estimates can be obtained for a variety of different underlying survival distributions.

Goodness-of-fit. The *Chi-square* value is computed as a function of the log-likelihood for the model with all independent variables (L1), and the log-likelihood of the model in which all independent variables are forced to 0 (zero, L0).

### Stratified Analyses

The purpose of a stratified analysis is to test the hypothesis whether identical regression models are appropriate for different groups, that is, whether the relationships between the independent variables and survival are identical in different groups. To perform a stratified analysis, one must first fit the respective regression model separately within each group. The sum of the log-likelihoods from these analyses represents the log-likelihood of the model with different regression coefficients (and intercepts where appropriate) in different groups. The next step is to fit the requested regression model to all data in the usual manner (i.e., ignoring group membership), and compute the log-likelihood for the overall fit. The difference between the log-likelihoods can then be tested for statistical significance (via the *Chi-square* statistic).

## Structural Equation Modeling

## A Conceptual Overview

*Structural Equation Modeling* is a very general, very powerful multivariate analysis technique that includes specialized versions of a number of other analysis methods as special cases. We will assume that you are familiar with the basic logic of statistical reasoning as described in *Elementary Concepts*. Moreover, we will also assume that you are familiar with the concepts of variance, covariance, and correlation; if not, we advise that you read the *Basic Statistics* section at this point. Although it is not absolutely necessary, it is highly desirable that you have some background in *factor analysis* before attempting to use structural modeling.

Major applications of structural equation modeling include:

*causal modeling*, or*path analysis*, which hypothesizes causal relationships among variables and tests the causal models with a linear equation system. Causal models can involve either manifest variables, latent variables, or both;*confirmatory factor analysis*, an extension of factor analysis in which specific hypotheses about the structure of the factor loadings and intercorrelations are tested;*second order factor analysis*, a variation of factor analysis in which the correlation matrix of the common factors is itself factor analyzed to provide second order factors;*regression models*, an extension of linear regression analysis in which regression weights may be constrained to be equal to each other, or to specified numerical values;*covariance structure models*, which hypothesize that a covariance matrix has a particular form. For example, you can test the hypothesis that a set of variables all have equal variances with this procedure;*correlation structure models*, which hypothesize that a correlation matrix has a particular form. A classic example is the hypothesis that the correlation matrix has the structure of a circumplex (Guttman, 1954; Wiggins, Steiger, & Gaelick, 1981).

Many different kinds of models fall into each of the above categories, so structural modeling as an enterprise is very difficult to characterize.

Most structural equation models can be expressed as path diagrams. Consequently even beginners to structural modeling can perform complicated analyses with a minimum of training.

## The Basic Idea Behind Structural Modeling

One of the fundamental ideas taught in intermediate applied statistics courses is the effect of additive and multiplicative transformations on a list of numbers. Students are taught that, if you multiply every number in a list by some constant K, you multiply the *mean* of the numbers by K. Similarly, you multiply the *standard deviation* by the absolute value of K.

For example, suppose you have the list of numbers 1,2,3. These numbers have a *mean* of 2 and a *standard deviation* of 1. Now, suppose you were to take these 3 numbers and multiply them by 4. Then the mean would become 8, and the standard deviation would become 4, the variance thus 16.

The point is, if you have a set of numbers X related to another set of numbers Y by the equation Y = 4X, then the variance of Y *must* be 16 times that of X, so you can test the hypothesis that Y and X are related by the equation Y = 4X *indirectly* by comparing the variances of the Y and X variables.

This idea generalizes, in various ways, to several variables inter-related by a group of linear equations. The rules become more complex, the calculations more difficult, but the basic message remains the same — *you can test whether variables are interrelated through a set of linear relationships by examining the variances and covariances of the variables.*

Statisticians have developed procedures for testing whether a set of variances and covariances in a covariance matrix fits a specified structure. The way structural modeling works is as follows:

- You state the way that you believe the variables are inter-related, often with the use of a
*path diagram*. - You work out, via some complex internal rules, what the implications of this are for the variances and covariances of the variables.
- You test whether the variances and covariances fit this model of them.
- Results of the statistical testing, and also parameter estimates and standard errors for the numerical coefficients in the linear equations are reported.
- On the basis of this information, you decide whether the model seems like a good fit to your data.

There are some important, and very basic logical points to remember about this process. First, although the mathematical machinery required to perform structural equations modeling is extremely complicated, the basic logic is embodied in the above 5 steps. Below, we diagram the process.

Second, we must remember that it is unreasonable to expect a structural model to fit perfectly — for a number of reasons. A structural model with linear relations is only an approximation. The world is unlikely to be linear. Indeed, the true relations between variables are probably nonlinear. Moreover, many of the statistical assumptions are somewhat questionable as well. The real question is not so much, “Does the model fit perfectly?” but rather, “Does it fit well enough to be a useful approximation to reality, and a reasonable explanation of the trends in our data?”

Third, we must remember that simply because a model fits the data well does not mean that the model is necessarily correct. One cannot prove that a model is true — to assert this is the fallacy of affirming the consequent. For example, we could say “If Joe is a cat, Joe has hair.” However, “Joe has hair” does not imply Joe is a cat. Similarly, we can say that “If a certain causal model is true, it will fit the data.” However, the model fitting the data does not necessarily imply the model is the correct one. There may be another model that fits the data equally well.

## Structural Equation Modeling and the Path Diagram

*Path Diagrams *play a fundamental role in structural modeling. Path diagrams are like flowcharts. They show variables interconnected with lines that are used to indicate causal flow.

One can think of a path diagram as a device for showing which variables cause changes in other variables. However, path diagrams *need not* be thought of strictly in this way. They may also be given a narrower, more specific interpretation.

Consider the classic linear regression equation

Y = aX + e

Any such equation may be represented in a path diagram as follows:

Such diagrams establish a simple isomorphism. All variables in the equation system are placed in the diagram, either in boxes or ovals. Each equation is represented on the diagram as follows: All independent variables (the variables on the right side of an equation) have *arrows* pointing to the dependent variable. The weighting coefficient is placed above the arrow. The above diagram shows a simple linear equation system and its path diagram representation.

Notice that, besides representing the linear equation relationships with arrows, the diagrams also contain some additional aspects. First, the variances of the independent variables, which we must know in order to test the structural relations model, are shown on the diagrams using curved lines without arrowheads attached. We refer to such lines as wires. Second, some variables are represented in ovals, others in rectangular boxes. Manifest variables are placed in boxes in the path diagram. Latent variables are placed in an oval or circle. For example, the variable *E *in the above diagram can be thought of as a linear regression residual when Yis predicted from X. Such a residual is not observed directly, but calculated from Y and X, so we treat it as a *latent* variable and place it in an oval.

The example discussed above is an extremely simple one. Generally, we are interested in testing models that are much more complicated than these. As the equation systems we examine become increasingly complicated, so do the *covariance structures* they imply. Ultimately, the complexity can become so bewildering that we lose sight of some very basic principles. For one thing the train of reasoning which supports testing causal models with linear structural equations testing has several weak links. The variables may be non-linear. They may be linearly related for reasons unrelated to what we commonly view as causality. The ancient adage, “correlation is not causation” remains true, even if the correlation is complex and multivariate. What causal modeling *does *allow us to do is examine the extent to which data fail to agree with one reasonably viable consequence of a model of causality. If the linear equations system isomorphic to the path diagram does fit the data well, it is encouraging, but hardly proof of the truth of the causal model.

Although path diagrams can be used to represent causal flow in a system of variables, they need not imply such a causal flow. Such diagrams may be viewed as simply an isomorphic representation of a linear equations system. As such, they can convey linear relationships when no causal relations are assumed. Hence, although one *might* interpret the diagram in the above figure to mean that “X causes Y,” the diagram can also be interpreted as a visual representation of the linear regression relationship between X and Y.