Blog Archives

How to Plot Graphs on Multiple Scales

Graphing is a vital part of any data analysis project. Graphs visually reveal patterns and relationships between variables and provide invaluable information. At times, the patterns may be interesting; however, the scaling of the data can simultaneously interfere with the messages to be conveyed.
When units and scale vary greatly, seeing detail in all variables on a plot becomes quite impossible. This is when you know your multi-variable plot needs multiple, varying scales. Let’s look at our options…
Double Y Plots
Many graphing tools have a Graph type option called Double-Y. This graph type makes it possible for you to select one or more variables associated with the left Y axis and one or more variables to associate with the right Y axis. This is a simple way of creating a compound graph that shows variables with two different scales.
For example, open the STATISTICA data file, Baseball.sta, from the path C:/Program Files/StatSoft/STATISTICA 12/Examples/Datasets. Several of the variables in this example data file have very different scales.
On the Graphs tab in the Common group, click Scatterplot. In the 2D Scatterplots Start Panel, select the Advanced tab. In the Graph type group box, select Double-Y.
Now, click the Variables button, and in the variable selection dialog box, select RUNS as X, WIN as Y Left, and DP as Y Right. Click the OK button.
Click OK in the 2D Scatterplots Startup Panel to create the plot. The result lists the two Y variables with separately determined scales.
WIN shows a scale from 0.25 to 0.65. This is the season winning proportion. The variable DP is shown on a scale from 100 to 220 and is the number of double plays in the season. Because of the great difference in the scale of these two variables, a Double-Y plot is the best way to simultaneously show these variables’ relationships with the X factor, RUNS.
Multiple Y Plots
An additional option is available for creating plots with multiple axis scales. This option is used when you need more scales than the Double-Y allows or when you need an additional axis in another place or capacity.
Continuing the same example, add a second variable, BA, to the Y Left variable list.
Click OK to create the new plot.
Now, WIN and BA share the left Y axis. BA, batting average, is on a scale of .2 to .3. Giving BA a separate Y axis scale would show more detail in the added variable. To do this, right-click in the graph, and on the shortcut menu select Graph Options. Select the Axis – General tab of the Graph Options dialog box.
From the Axis drop-down menu, select the Y left axis. Then click Add new axis. A new Y left axis is added to the plot called Y left’.
Next, the BA variable needs to be related to that axis and customized. Select the Plot – General tab to make this change.
On the Plot drop-down list, select the variable BA. Then, in the Assignment of axis group, select the Custom option button, and specify Y left’ as the custom axis.
Click OK to update the plot.

The resulting plot now has three Y variables plotted, each with its own Y axis scaling and labeling. Showing patterns and relationships in data of varying scale is made easy with multiple axes.

Success Story – Nelson Mandela Metropolitan University

Ms Jennifer Bowler Lecturer in Industrial Psychology and Human Resources at Nelson Mandela Metropolitan University, testifies on her Statistica programme provided By Statsoft Southern Africa’s trainer Merle Weberlof and how it has benefited her.

” Personally, the two days were personally extremely beneficial. I had expected a “how to work with Statistica” and what I got was how to understand the relationship between research design, analytical tools and then how to do that in Statistica. I felt very sad last night that I had not been exposed to someone like Merle much earlier on in my research career- it would have saved me many, many hours of confusion and frustration. I am however, pleased to say that my nervousness regarding Statistica has been put to rest and I have clarified many issues regarding analysis and design.
I know that Merle was worried that she lost people at certain times and I would not presume to comment for the others but each one of the group was at different stages of personal development as far as research is concerned plus we had different disciplines represented- my sense was that each person took away something of value and application even if they could not understand and/or utilise all  that was offered.
Thanks very much for accommodating us for the two days.

If there is anything else that you want feedback on – please let me know.
I hope that you had a good trip to CT ”

Kind regards
Jennifer Bowler

Text Mining Introductory Overview

The purpose of Text Mining is to process unstructured (textual) information, extract meaningful numeric indices from the text, and, thus, make the information contained in the text accessible to the various data mining (statistical and machine learning) algorithms. Information can be extracted to derive summaries for the words contained in the documents or to compute summaries for the documents based on the words contained in them. Hence, you can analyze words, clusters of words used in documents, etc., or you could analyze documents and determine similarities between them or how they are related to other variables of interest in the data mining project. In the most general terms, text mining will “turn text into numbers” (meaningful indices), which can then be incorporated in other analyses such as predictive data mining projects, the application of unsupervised learning methods (clustering), etc. These methods are described and discussed in great detail in the comprehensive overview work by Manning and Schütze (2002), and for an in-depth treatment of these and related topics as well as the history of this approach to text mining, we highly recommend that source.

Typical Applications for Text Mining

Unstructured text is very common, and in fact may represent the majority of information available to a particular research or data mining project.

Analyzing open-ended survey responses. In survey research (e.g., marketing), it is not uncommon to include various open-ended questions pertaining to the topic under investigation. The idea is to permit respondents to express their “views” or opinions without constraining them to particular dimensions or a particular response format. This may yield insights into customers’ views and opinions that might otherwise not be discovered when relying solely on structured questionnaires designed by “experts.” For example, you may discover a certain set of words or terms that are commonly used by respondents to describe the pro’s and con’s of a product or service (under investigation), suggesting common misconceptions or confusion regarding the items in the study.

Automatic processing of messages, emails, etc. Another common application for text mining is to aid in the automatic classification of texts. For example, it is possible to “filter” out automatically most undesirable “junk email” based on certain terms or words that are not likely to appear in legitimate messages, but instead identify undesirable electronic mail. In this manner, such messages can automatically be discarded. Such automatic systems for classifying electronic messages can also be useful in applications where messages need to be routed (automatically) to the most appropriate department or agency; e.g., email messages with complaints or petitions to a municipal authority are automatically routed to the appropriate departments; at the same time, the emails are screened for inappropriate or obscene messages, which are automatically returned to the sender with a request to remove the offending words or content.

Analyzing warranty or insurance claims, diagnostic interviews, etc. In some business domains, the majority of information is collected in open-ended, textual form. For example, warranty claims or initial medical (patient) interviews can be summarized in brief narratives, or when you take your automobile to a service station for repairs, typically, the attendant will write some notes about the problems that you report and what you believe needs to be fixed. Increasingly, those notes are collected electronically, so those types of narratives are readily available for input into text mining algorithms. This information can then be usefully exploited to, for example, identify common clusters of problems and complaints on certain automobiles, etc. Likewise, in the medical field, open-ended descriptions by patients of their own symptoms might yield useful clues for the actual medical diagnosis.

Investigating competitors by crawling their web sites. Another type of potentially very useful application is to automatically process the contents of Web pages in a particular domain. For example, you could go to a Web page, and begin “crawling” the links you find there to process all Web pages that are referenced. In this manner, you could automatically derive a list of terms and documents available at that site, and hence quickly determine the most important terms and features that are described. It is easy to see how these capabilities could efficiently deliver valuable business intelligence about the activities of competitors.

Approaches to Text Mining

To reiterate, text mining can be summarized as a process of “numericizing” text. At the simplest level, all words found in the input documents will be indexed and counted in order to compute a table of documents and words, i.e., a matrix of frequencies that enumerates the number of times that each word occurs in each document. This basic process can be further refined to exclude certain common words such as “the” and “a” (stop word lists) and to combine different grammatical forms of the same words such as “traveling,” “traveled,” “travel,” etc. (stemming). However, once a table of (unique) words (terms) by documents has been derived, all standard statistical and data mining techniques can be applied to derive dimensions or clusters of words or documents, or to identify “important” words or terms that best predict another outcome variable of interest.

Using well-tested methods and understanding the results of text mining. Once a data matrix has been computed from the input documents and words found in those documents, various well-known analytic techniques can be used for further processing those data including methods for clustering, factoring, or predictive data mining (see, for example, Manning and Schütze, 2002).

“Black-box” approaches to text mining and extraction of concepts. There are text mining applications which offer “black-box” methods to extract “deep meaning” from documents with little human effort (to first read and understand those documents). These text mining applications rely on proprietary algorithms for presumably extracting “concepts” from text, and may even claim to be able to summarize large numbers of text documents automatically, retaining the core and most important meaning of those documents. While there are numerous algorithmic approaches to extracting “meaning from documents,” this type of technology is very much still in its infancy, and the aspiration to provide meaningful automated summaries of large numbers of documents may forever remain elusive. We urge skepticism when using such algorithms because 1) if it is not clear to the user how those algorithms work, it cannot possibly be clear how to interpret the results of those algorithms, and 2) the methods used in those programs are not open to scrutiny, for example by the academic community and peer review and, hence, we simply don’t know how well they might perform in different domains. As a final thought on this subject, you may consider this concrete example: Try the various automated translation services available via the Web that can translate entire paragraphs of text from one language into another. Then translate some text, even simple text, from your native language to some other language and back, and review the results. Almost every time, the attempt to translate even short sentences to other languages and back while retaining the original meaning of the sentence produces humorous rather than accurate results. This illustrates the difficulty of automatically interpreting the meaning of text.

Text mining as document search. There is another type of application that is often described and referred to as “text mining” – the automatic search of large numbers of documents based on key words or key phrases. This is the domain of, for example, the popular internet search engines that have been developed over the last decade to provide efficient access to Web pages with certain content. While this is obviously an important type of application with many uses in any organization that needs to search very large document repositories based on varying criteria, it is very different from what has been described here.

Issues and Considerations for “Numericizing” Text

Large numbers of small documents vs. small numbers of large documents. Examples of scenarios using large numbers of small or moderate sized documents were given earlier (e.g., analyzing warranty or insurance claims, diagnostic interviews, etc.). On the other hand, if your intent is to extract “concepts” from only a few documents that are very large (e.g., two lengthy books), then statistical analyses are generally less powerful because the “number of cases” (documents) in this case is very small while the “number of variables” (extracted words) is very large.

Excluding certain characters, short words, numbers, etc. Excluding numbers, certain characters, or sequences of characters, or words that are shorter or longer than a certain number of letters can be done before the indexing of the input documents starts. You may also want to exclude “rare words,” defined as those that only occur in a small percentage of the processed documents.

Include lists, exclude lists (stop-words). Specific list of words to be indexed can be defined; this is useful when you want to search explicitly for particular words, and classify the input documents based on the frequencies with which those words occur. Also, “stop-words,” i.e., terms that are to be excluded from the indexing can be defined. Typically, a default list of English stop words includes “the”, “a”, “of”, “since,” etc, i.e., words that are used in the respective language very frequently, but communicate very little unique information about the contents of the document.

Synonyms and phrases. Synonyms, such as “sick” or “ill”, or words that are used in particular phrases where they denote unique meaning can be combined for indexing. For example, “Microsoft Windows” might be such a phrase, which is a specific reference to the computer operating system, but has nothing to do with the common use of the term “Windows” as it might, for example, be used in descriptions of home improvement projects.

Stemming algorithms. An important pre-processing step before indexing of input documents begins is the stemming of words. The term “stemming” refers to the reduction of words to their roots so that, for example, different grammatical forms or declinations of verbs are identified and indexed (counted) as the same word. For example, stemming will ensure that both “traveling” and “traveled” will be recognized by the text mining program as the same word.

Support for different languages. Stemming, synonyms, the letters that are permitted in words, etc. are highly language dependent operations. Therefore, support for different languages is important.

Transforming Word Frequencies

Once the input documents have been indexed and the initial word frequencies (by document) computed, a number of additional transformations can be performed to summarize and aggregate the information that was extracted.

Log-frequencies. First, various transformations of the frequency counts can be performed. The raw word or term frequencies generally reflect on how salient or important a word is in each document. Specifically, words that occur with greater frequency in a document are better descriptors of the contents of that document. However, it is not reasonable to assume that the word counts themselves are proportional to their importance as descriptors of the documents. For example, if a word occurs 1 time in document A, but 3 times in document B, then it is not necessarily reasonable to conclude that this word is 3 times as important a descriptor of document B as compared to document A. Thus, a common transformation of the raw word frequency counts (wf) is to compute:

f(wf) = 1+ log(wf), for wf > 0

This transformation will “dampen” the raw frequencies and how they will affect the results of subsequent computations.

Binary frequencies. Likewise, an even simpler transformation can be used that enumerates whether a term is used in a document; i.e.:

f(wf) = 1, for wf > 0

The resulting documents-by-words matrix will contain only 1s and 0s to indicate the presence or absence of the respective words. Again, this transformation will dampen the effect of the raw frequency counts on subsequent computations and analyses.

Inverse document frequencies. Another issue that you may want to consider more carefully and reflect in the indices used in further analyses are the relative document frequencies (df) of different words. For example, a term such as “guess” may occur frequently in all documents, while another term such as “software” may only occur in a few. The reason is that we might make “guesses” in various contexts, regardless of the specific topic, while “software” is a more semantically focused term that is only likely to occur in documents that deal with computer software. A common and very useful transformation that reflects both the specificity of words (document frequencies) as well as the overall frequencies of their occurrences (word frequencies) is the so-called inverse document frequency (for the i’th word and j’th document):

In this formula (see also formula 15.5 in Manning and Schütze, 2002), N is the total number of documents, and dfi is the document frequency for the i‘th word (the number of documents that include this word). Hence, it can be seen that this formula includes both the dampening of the simple word frequencies via the log function (described above), and also includes a weighting factor that evaluates to 0 if the word occurs in all documents (log(N/N=1)=0), and to the maximum value when a word only occurs in a single document (log(N/1)=log(N)). It can easily be seen how this transformation will create indices that both reflect the relative frequencies of occurrences of words, as well as their semantic specificities over the documents included in the analysis.

Latent Semantic Indexing via Singular Value Decomposition

As described above, the most basic result of the initial indexing of words found in the input documents is a frequency table with simple counts, i.e., the number of times that different words occur in each input document. Usually, we would transform those raw counts to indices that better reflect the (relative) “importance” of words and/or their semantic specificity in the context of the set of input documents (see the discussion of inverse document frequencies, above).

A common analytic tool for interpreting the “meaning” or “semantic space” described by the words that were extracted, and hence by the documents that were analyzed, is to create a mapping of the word and documents into a common space, computed from the word frequencies or transformed word frequencies (e.g., inverse document frequencies). In general, here is how it works:

Suppose you indexed a collection of customer reviews of their new automobiles (e.g., for different makes and models). You may find that every time a review includes the word “gas-mileage,” it also includes the term “economy.” Further, when reports include the word “reliability” they also include the term “defects” (e.g., make reference to “no defects”). However, there is no consistent pattern regarding the use of the terms “economy” and “reliability,” i.e., some documents include either one or both. In other words, these four words “gas-mileage” and “economy,” and “reliability” and “defects,” describe two independent dimensions – the first having to do with the overall operating cost of the vehicle, the other with the quality and workmanship. The idea of latent semantic indexing is to identify such underlying dimensions (of “meaning”), into which the words and documents can be mapped. As a result, we may identify the underlying (latent) themes described or discussed in the input documents, and also identify the documents that mostly deal with economy, reliability, or both. Hence, we want to map the extracted words or terms and input documents into a common latent semantic space.

Singular value decomposition. The use of singular value decomposition in order to extract a common space for the variables and cases (observations) is used in various statistical techniques, most notably in Correspondence Analysis. The technique is also closely related to Principal Components Analysis and Factor Analysis. In general, the purpose of this technique is to reduce the overall dimensionality of the input matrix (number of input documents by number of extracted words) to a lower-dimensional space, where each consecutive dimension represents the largest degree of variability (between words and documents) possible. Ideally, you might identify the two or three most salient dimensions, accounting for most of the variability (differences) between the words and documents and, hence, identify the latent semantic space that organizes the words and documents in the analysis. In some way, once such dimensions can be identified, you have extracted the underlying “meaning” of what is contained (discussed, described) in the documents.

Incorporating Text Mining Results in Data Mining Projects

After significant (e.g., frequent) words have been extracted from a set of input documents, and/or after singular value decomposition has been applied to extract salient semantic dimensions, typically the next and most important step is to use the extracted information in a data mining project.

Graphics (visual data mining methods). Depending on the purpose of the analyses, in some instances the extraction of semantic dimensions alone can be a useful outcome if it clarifies the underlying structure of what is contained in the input documents. For example, a study of new car owners’ comments about their vehicles may uncover the salient dimensions in the minds of those drivers when they think about or consider their automobile (or how they “feel” about it). For marketing research purposes, that in itself can be a useful and significant result. You can use the graphics (e.g., 2D scatterplots or 3D scatterplots) to help you visualize and identify the semantic space extracted from the input documents.

Clustering and factoring. You can use cluster analysis methods to identify groups of documents (e.g., vehicle owners who described their new cars), to identify groups of similar input texts. This type of analysis also could be extremely useful in the context of market research studies, for example of new car owners. You can also use Factor Analysis and Principal Components and Classification Analysis (to factor analyze words or documents).

Predictive data mining. Another possibility is to use the raw or transformed word counts as predictor variables in predictive data mining projects.

General Purpose

In some research applications, we can formulate hypotheses about the specific distribution of the variable of interest. For example, variables whose values are determined by an infinite number of independent random events will be distributed following the normal distribution: we can think of a person’s height as being the result of very many independent factors such as numerous specific genetic predispositions, early childhood diseases, nutrition, etc. (see the animation below for an example of the normal distribution). As a result, height tends to be normally distributed in the U.S. population. On the other hand, if the values of a variable are the result of very rare events, then the variable will be distributed according to the Poisson distribution (sometimes called the distribution of rare events). For example, industrial accidents can be thought of as the result of the intersection of a series of unfortunate (and unlikely) events, and their frequency tends to be distributed according to the Poisson distribution. These and other distributions are described in greater detail in the respective glossary topics.

Another common application where distribution fitting procedures are useful is when we want to verify the assumption of normality before using some parametric test (see General Purpose of Nonparametric Tests). For example, you may want to use the Kolmogorov-Smirnov test for normality or the Shapiro-Wilks’ W test to test for normality.

Fit of the Observed Distribution

For predictive purposes it is often desirable to understand the shape of the underlying distribution of the population. To determine this underlying distribution, it is common to fit the observed distribution to a theoretical distribution by comparing the frequencies observed in the data to the expected frequencies of the theoretical distribution (i.e., a Chi-square goodness of fit test). In addition to this type a test, some software packages also allow you to compute Maximum Likelihood tests and Method of Matching Moments (see Fitting Distributions by Moments in the Process Analysis topic) tests.

Which Distribution to use. As described above, certain types of variables follow specific distributions. Variables whose values are determined by an infinite number of independent random events will be distributed following the normal distribution, whereas variables whose values are the result of an extremely rare event would follow the Poisson distribution. The major distributions that have been proposed for modeling survival or failure times are the exponential (and linear exponential) distribution, the Weibull distribution of extreme events, and the Gompertz distribution. The section on types of distributions contains a number of distributions generally giving a brief example of what type of data would most commonly follow a specific distribution as well as the probability density function (pdf) for each distribution.

Types of Distributions

Bernoulli Distribution. This distribution best describes all situations where a “trial” is made resulting in either “success” or “failure,” such as when tossing a coin, or when modeling the success or failure of a surgical procedure. The Bernoulli distribution is defined as:

f(x) = px *(1-p)1-x,    for x = 0, 1

where

 p is the probability that a particular event (e.g., success) will occur.

Beta Distribution. The beta distribution arises from a transformation of the F distribution and is typically used to model the distribution of order statistics. Because the beta distribution is bounded on both sides, it is often used for representing processes with natural lower and upper limits. For examples, refer to Hahn and Shapiro (1967). The beta distribution is defined as:

f(x) = G (?+?)/[G(?)G(?)] * x?-1*(1-x)?-1,    for 0 < x < 1, ? > 0, ? > 0

where

 G is the Gamma function ?, ? are the shape parameters (Shape1 and Shape2, respectively)

The animation above shows the beta distribution as the two shape parameters change.

Binomial Distribution. The binomial distribution is useful for describing distributions of binomial events, such as the number of males and females in a random sample of companies, or the number of defective components in samples of 20 units taken from a production process. The binomial distribution is defined as:

f(x) = [n!/(x!*(n-x)!)]*px * qn-x,    for x = 0,1,2,…,n

where

 p is the probability that the respective event will occur q is equal to 1-p n is the maximum number of independent trials.

Cauchy Distribution. The Cauchy distribution is interesting for theoretical reasons. Although its mean can be taken as zero, since it is symmetrical about zero, the expectation, variance, higher moments, and moment generating function do not exist. The Cauchy distribution is defined as:

f(x) = 1/(?*p*{1+[(x- ?)/ ?]2}),    for 0 < ?

where

 ? is the location parameter (median) ? is the scale parameter p is the constant Pi (3.1415…)

The animation above shows the changing shape of the Cauchy distribution when the location parameter equals 0 and the scale parameter equals 1, 2, 3, and 4.

Chi-square Distribution. The sum of n independent squared random variables, each distributed following the standard normal distribution, is distributed as Chi-square with n degrees of freedom. This distribution is most frequently used in the modeling of random variables (e.g., representing frequencies) in statistical applications. The Chi-square distribution is defined by:

f(x) = {1/[2?/2* G(?/2)]} * [x(?/2)-1 * e-x/2],    for ? = 1, 2, …, 0 < x

where

 ? is the degrees of freedom e is the base of the natural logarithm, sometimes called Euler’s e (2.71…) G (gamma) is the Gamma function.

The above animation shows the shape of the Chi-square distribution as the degrees of freedom increase (1, 2, 5, 10, 25 and 50).

Exponential Distribution. If T is the time between occurrences of rare events that happen on the average with a rate l per unit of time, then T is distributed exponentially with parameter l (lambda). Thus, the exponential distribution is frequently used to model the time interval between successive random events. Examples of variables distributed in this manner would be the gap length between cars crossing an intersection, life-times of electronic devices, or arrivals of customers at the check-out counter in a grocery store. The exponential distribution function is defined as:

f(x) = ? *e-? x    for 0 = x < 8, ? > 0

where

 ? is an exponential function parameter (an alternative parameterization is scale parameter b=1/l) e is the base of the natural logarithm, sometimes called Euler’s e (2.71…)

Extreme Value. The extreme value distribution is often used to model extreme events, such as the size of floods, gust velocities encountered by airplanes, maxima of stock market indices over a given year, etc.; it is also often used in reliability testing, for example in order to represent the distribution of failure times for electric circuits (see Hahn and Shapiro, 1967). The extreme value (Type I) distribution has the probability density function:

f(x) = 1/b * e^[-(x-a)/b] * e^{-e^[-(x-a)/b]},    for -8 < x < 8, b > 0

where

 a is the location parameter b is the scale parameter e is the base of the natural logarithm, sometimes called Euler’s e (2.71…)

F Distribution. Snedecor’s F distribution is most commonly used in tests of variance (e.g., ANOVA). The ratio of two chi-squares divided by their respective degrees of freedom is said to follow an F distribution. The F distribution (for x > 0) has the probability density function (for n = 1, 2, …; w = 1, 2, …):

f(x) = [G{(?+?)/2}]/[G(?/2)G(?/2)] * (?/?)(?/2) * x[(?/2)-1] * {1+[(?/?)*x]}[-(?+?)/2],    for 0 = x < 8, ?=1,2,…, ?=1,2,…

where

 ?, ? are the shape parameters, degrees of freedom G is the Gamma function

The animation above shows various tail areas (p-values) for an F distribution with both degrees of freedom equal to 10.

Gamma Distribution. The probability density function of the exponential distribution has a mode of zero. In many instances, it is known a priori that the mode of the distribution of a particular random variable of interest is not equal to zero (e.g., when modeling the distribution of the life-times of a product such as an electric light bulb, or the serving time taken at a ticket booth at a baseball game). In those cases, the gamma distribution is more appropriate for describing the underlying distribution. The gamma distribution is defined as:

f(x) = {1/[bG(c)]}*[x/b]c-1*e-x/b    for 0 = x, c > 0

where

 G is the Gamma function c is the Shape parameter b is the Scale parameter. e is the base of the natural logarithm, sometimes called Euler’s e (2.71…)

The animation above shows the gamma distribution as the shape parameter changes from 1 to 6.

Geometric Distribution. If independent Bernoulli trials are made until a “success” occurs, then the total number of trials required is a geometric random variable. The geometric distribution is defined as:

f(x) = p*(1-p)x,    for x = 1,2,…

where

 p is the probability that a particular event (e.g., success) will occur.

Gompertz Distribution. The Gompertz distribution is a theoretical distribution of survival times. Gompertz (1825) proposed a probability model for human mortality, based on the assumption that the “average exhaustion of a man’s power to avoid death to be such that at the end of equal infinitely small intervals of time he lost equal portions of his remaining power to oppose destruction which he had at the commencement of these intervals” (Johnson, Kotz, Balakrishnan, 1995, p. 25). The resultant hazard function:

r(x)=Bcx,    for x = 0, B > 0, c = 1

is often used in survival analysis. See Johnson, Kotz, Balakrishnan (1995) for additional details.

Laplace Distribution. For interesting mathematical applications of the Laplace distribution see Johnson and Kotz (1995). The Laplace (or Double Exponential) distribution is defined as:

f(x) = 1/(2b) * e[-(|x-a|/b)],    for -8 < x < 8

where

 a is the location parameter (mean) b is the scale parameter e is the base of the natural logarithm, sometimes called Euler’s e (2.71…)

The graphic above shows the changing shape of the Laplace distribution when the location parameter equals 0 and the scale parameter equals 1, 2, 3, and 4.

Logistic Distribution. The logistic distribution is used to model binary responses (e.g., Gender) and is commonly used in logistic regression. The logistic distribution is defined as:

f(x) = (1/b) * e[-(x-a)/b] * {1+e[-(x-a)/b]}^-2,    for -8 < x < 8, 0 < b

where

 a is the location parameter (mean) b is the scale parameter e is the base of the natural logarithm, sometimes called Euler’s e (2.71…)

The graphic above shows the changing shape of the logistic distribution when the location parameter equals 0 and the scale parameter equals 1, 2, and 3.

Log-normal Distribution. The log-normal distribution is often used in simulations of variables such as personal incomes, age at first marriage, or tolerance to poison in animals. In general, if x is a sample from a normal distribution, then y = ex is a sample from a log-normal distribution. Thus, the log-normal distribution is defined as:

f(x) = 1/[xs(2p)1/2] * e-[log(x)-µ]**2/2s**2,    for 0 < x < 8, µ > 0, s > 0

where

 µ is the scale parameter s is the shape parameter e is the base of the natural logarithm, sometimes called Euler’s e (2.71…)

The animation above shows the log-normal distribution with mu equal to 0 and sigma equals .10, .30, .50, .70, and .90.

Normal Distribution. The normal distribution (the “bell-shaped curve” which is symmetrical about the mean) is a theoretical function commonly used in inferential statistics as an approximation to sampling distributions (see also Elementary Concepts). In general, the normal distribution provides a good model for a random variable, when:

1. There is a strong tendency for the variable to take a central value;
2. Positive and negative deviations from this central value are equally likely;
3. The frequency of deviations falls off rapidly as the deviations become larger.

As an underlying mechanism that produces the normal distribution, we can think of an infinite number of independent random (binomial) events that bring about the values of a particular variable. For example, there are probably a nearly infinite number of factors that determine a person’s height (thousands of genes, nutrition, diseases, etc.). Thus, height can be expected to be normally distributed in the population. The normal distribution function is determined by the following formula:

f(x) = 1/[(2*p)1/2*s] * e**{-1/2*[(x-µ)/s]2 },    for -8 < x < 8

where

 µ is the mean s is the standard deviation e is the base of the natural logarithm, sometimes called Euler’s e (2.71…) p is the constant Pi (3.14…)

The animation above shows several tail areas of the standard normal distribution (i.e., the normal distribution with a mean of 0 and a standard deviation of 1). The standard normal distribution is often used in hypothesis testing.

Pareto Distribution. The Pareto distribution is commonly used in monitoring production processes (see Quality Control and Process Analysis). For example, a machine which produces copper wire will occasionally generate a flaw at some point along the wire. The Pareto distribution can be used to model the length of wire between successive flaws. The standard Pareto distribution is defined as:

f(x) = c/xc+1,    for 1 = x, c < 0

where

 c is the shape parameter

The animation above shows the Pareto distribution for the shape parameter equal to 1, 2, 3, 4, and 5.

Poisson Distribution. The Poisson distribution is also sometimes referred to as the distribution of rare events. Examples of Poisson distributed variables are number of accidents per person, number of sweepstakes won per person, or the number of catastrophic defects found in a production process. It is defined as:

f(x) = (?x*e-?)/x!,    for x = 0,1,2,…, 0 < ?

where

 ? (lambda) is the expected value of x (the mean) e is the base of the natural logarithm, sometimes called Euler’s e (2.71…)

Rayleigh Distribution. If two independent variables y1 and y2 are independent from each other and normally distributed with equal variance, then the variable x = Ö(y12+ y22) will follow the Rayleigh distribution. Thus, an example (and appropriate metaphor) for such a variable would be the distance of darts from the target in a dart-throwing game, where the errors in the two dimensions of the target plane are independent and normally distributed. The Rayleigh distribution is defined as:

f(x) = x/b2 * e^[-(x2/2b2)],    for 0 = x < 8, b > 0

where

 b is the scale parameter e is the base of the natural logarithm, sometimes called Euler’s e (2.71…)

The graphic above shows the changing shape of the Rayleigh distribution when the scale parameter equals 1, 2, and 3.

Rectangular Distribution. The rectangular distribution is useful for describing random variables with a constant probability density over the defined range a<b.

f(x) = 1/(b-a),    for a<x<b
= 0 ,           elsewhere

where

 a

Student’s t Distribution. The student’s t distribution is symmetric about zero, and its general shape is similar to that of the standard normal distribution. It is most commonly used in testing hypothesis about the mean of a particular population. The student’s t distribution is defined as (for n = 1, 2, . . .):

f(x) = G[(?+1)/2] / G(?/2) * (?*p)-1/2 * [1 + (x2/?)-(?+1)/2]

where

 ? is the shape parameter, degrees of freedom G is the Gamma function p is the constant Pi (3.14 . . .)

The shape of the student’s t distribution is determined by the degrees of freedom. As shown in the animation above, its shape changes as the degrees of freedom increase.

Weibull Distribution. As described earlier, the exponential distribution is often used as a model of time-to-failure measurements, when the failure (hazard) rate is constant over time. When the failure probability varies over time, then the Weibull distribution is appropriate. Thus, the Weibull distribution is often used in reliability testing (e.g., of electronic relays, ball bearings, etc.; see Hahn and Shapiro, 1967). The Weibull distribution is defined as:

f(x) = c/b*(x/b)(c-1) * e[-(x/b)^c],    for 0 = x < 8, b > 0, c > 0

where

 b is the scale parameter c is the shape parameter e is the base of the natural logarithm, sometimes called Euler’s e (2.71…)

The animation above shows the Weibull distribution as the shape parameter increases (.5, 1, 2, 3, 4, 5, and 10).

Discover Which Variables Discriminate Between Groups, Discriminant Function Analysis

General Purpose

Discriminant function analysis is used to determine which variables discriminate between two or more naturally occurring groups. For example, an educational researcher may want to investigate which variables discriminate between high school graduates who decide (1) to go to college, (2) to attend a trade or professional school, or (3) to seek no further training or education. For that purpose the researcher could collect data on numerous variables prior to students’ graduation. After graduation, most students will naturally fall into one of the three categories. Discriminant Analysis could then be used to determine which variable(s) are the best predictors of students’ subsequent educational choice.

A medical researcher may record different variables relating to patients’ backgrounds in order to learn which variables best predict whether a patient is likely to recover completely (group 1), partially (group 2), or not at all (group 3). A biologist could record different characteristics of similar types (groups) of flowers, and then perform a discriminant function analysis to determine the set of characteristics that allows for the best discrimination between the types.

Computational Approach

Computationally, discriminant function analysis is very similar to analysis of variance (ANOVA). Let us consider a simple example. Suppose we measure height in a random sample of 50 males and 50 females. Females are, on the average, not as tall as males, and this difference will be reflected in the difference in means (for the variable Height). Therefore, variable height allows us to discriminate between males and females with a better than chance probability: if a person is tall, then he is likely to be a male, if a person is short, then she is likely to be a female.

We can generalize this reasoning to groups and variables that are less “trivial.” For example, suppose we have two groups of high school graduates: Those who choose to attend college after graduation and those who do not. We could have measured students’ stated intention to continue on to college one year prior to graduation. If the means for the two groups (those who actually went to college and those who did not) are different, then we can say that intention to attend college as stated one year prior to graduation allows us to discriminate between those who are and are not college bound (and this information may be used by career counselors to provide the appropriate guidance to the respective students).

To summarize the discussion so far, the basic idea underlying discriminant function analysis is to determine whether groups differ with regard to the mean of a variable, and then to use that variable to predict group membership (e.g., of new cases).

Analysis of Variance. Stated in this manner, the discriminant function problem can be rephrased as a one-way analysis of variance (ANOVA) problem. Specifically, one can ask whether or not two or more groups are significantly different from each other with respect to the mean of a particular variable. To learn more about how one can test for the statistical significance of differences between means in different groups you may want to read the Overview section to ANOVA/MANOVA. However, it should be clear that, if the means for a variable are significantly different in different groups, then we can say that this variable discriminates between the groups.

In the case of a single variable, the final significance test of whether or not a variable discriminates between groups is the F test. As described in Elementary Concepts and ANOVA /MANOVA, F is essentially computed as the ratio of the between-groups variance in the data over the pooled (average) within-group variance. If the between-group variance is significantly larger then there must be significant differences between means.

Multiple Variables. Usually, one includes several variables in a study in order to see which one(s) contribute to the discrimination between groups. In that case, we have a matrix of total variances and covariances; likewise, we have a matrix of pooled within-group variances and covariances. We can compare those two matrices via multivariate F tests in order to determined whether or not there are any significant differences (with regard to all variables) between groups. This procedure is identical to multivariate analysis of variance or MANOVA. As in MANOVA, one could first perform the multivariate test, and, if statistically significant, proceed to see which of the variables have significantly different means across the groups. Thus, even though the computations with multiple variables are more complex, the principal reasoning still applies, namely, that we are looking for variables that discriminate between groups, as evident in observed mean differences.

Stepwise Discriminant Analysis

Probably the most common application of discriminant function analysis is to include many measures in the study, in order to determine the ones that discriminate between groups. For example, an educational researcher interested in predicting high school graduates’ choices for further education would probably include as many measures of personality, achievement motivation, academic performance, etc. as possible in order to learn which one(s) offer the best prediction.

Model. Put another way, we want to build a “model” of how we can best predict to which group a case belongs. In the following discussion we will use the term “in the model” in order to refer to variables that are included in the prediction of group membership, and we will refer to variables as being “not in the model” if they are not included.

Forward stepwise analysis. In stepwise discriminant function analysis, a model of discrimination is built step-by-step. Specifically, at each step all variables are reviewed and evaluated to determine which one will contribute most to the discrimination between groups. That variable will then be included in the model, and the process starts again.

Backward stepwise analysis. One can also step backwards; in that case all variables are included in the model and then, at each step, the variable that contributes least to the prediction of group membership is eliminated. Thus, as the result of a successful discriminant function analysis, one would only keep the “important” variables in the model, that is, those variables that contribute the most to the discrimination between groups.

F to enter, F to remove. The stepwise procedure is “guided” by the respective F to enter and F to remove values. The F value for a variable indicates its statistical significance in the discrimination between groups, that is, it is a measure of the extent to which a variable makes a unique contribution to the prediction of group membership. If you are familiar with stepwise multiple regression procedures, then you may interpret the F to enter/remove values in the same way as in stepwise regression.

Capitalizing on chance. A common misinterpretation of the results of stepwise discriminant analysis is to take statistical significance levels at face value. By nature, the stepwise procedures will capitalize on chance because they “pick and choose” the variables to be included in the model so as to yield maximum discrimination. Thus, when using the stepwise approach the researcher should be aware that the significance levels do not reflect the true alpha error rate, that is, the probability of erroneously rejecting H0 (the null hypothesis that there is no discrimination between groups).

Interpreting a Two-Group Discriminant Function

In the two-group case, discriminant function analysis can also be thought of as (and is analogous to) multiple regression (see Multiple Regression; the two-group discriminant analysis is also called Fisher linear discriminant analysis after Fisher, 1936; computationally all of these approaches are analogous). If we code the two groups in the analysis as 1 and 2, and use that variable as the dependent variable in a multiple regression analysis, then we would get results that are analogous to those we would obtain via Discriminant Analysis. In general, in the two-group case we fit a linear equation of the type:

Group = a + b1*x1 + b2*x2 + … + bm*xm

where a is a constant and b1 through bm are regression coefficients. The interpretation of the results of a two-group problem is straightforward and closely follows the logic of multiple regression: Those variables with the largest (standardized) regression coefficients are the ones that contribute most to the prediction of group membership.

Discriminant Functions for Multiple Groups

When there are more than two groups, then we can estimate more than one discriminant function like the one presented above. For example, when there are three groups, we could estimate (1) a function for discriminating between group 1 and groups 2 and 3 combined, and (2) another function for discriminating between group 2 and group 3. For example, we could have one function that discriminates between those high school graduates that go to college and those who do not (but rather get a job or go to a professional or trade school), and a second function to discriminate between those graduates that go to a professional or trade school versus those who get a job. The b coefficients in those discriminant functions could then be interpreted as before.

Canonical analysis. When actually performing a multiple group discriminant analysis, we do not have to specify how to combine groups so as to form different discriminant functions. Rather, you can automatically determine some optimal combination of variables so that the first function provides the most overall discrimination between groups, the second provides second most, and so on. Moreover, the functions will be independent or orthogonal, that is, their contributions to the discrimination between groups will not overlap. Computationally, you will perform a canonical correlation analysis (see also Canonical Correlation) that will determine the successive functions and canonical roots (the term root refers to the eigenvalues that are associated with the respective canonical function). The maximum number of functions will be equal to the number of groups minus one, or the number of variables in the analysis, whichever is smaller.

Interpreting the discriminant functions. As before, we will get b (and standardized beta) coefficients for each variable in each discriminant (now also called canonical) function, and they can be interpreted as usual: the larger the standardized coefficient, the greater is the contribution of the respective variable to the discrimination between groups. (Note that we could also interpret the structure coefficients; see below.) However, these coefficients do not tell us between which of the groups the respective functions discriminate. We can identify the nature of the discrimination for each discriminant (canonical) function by looking at the means for the functions across groups. We can also visualize how the two functions discriminate between groups by plotting the individual scores for the two discriminant functions (see the example graph below).

In this example, Root (function) 1 seems to discriminate mostly between groups Setosa, and Virginic and Versicol combined. In the vertical direction (Root 2), a slight trend of Versicol points to fall below the center line (0) is apparent.

Factor structure matrix. Another way to determine which variables “mark” or define a particular discriminant function is to look at the factor structure. The factor structure coefficients are the correlations between the variables in the model and the discriminant functions; if you are familiar with factor analysis (see Factor Analysis) you may think of these correlations as factor loadings of the variables on each discriminant function.

Some authors have argued that these structure coefficients should be used when interpreting the substantive “meaning” of discriminant functions. The reasons given by those authors are that (1) supposedly the structure coefficients are more stable, and (2) they allow for the interpretation of factors (discriminant functions) in the manner that is analogous to factor analysis. However, subsequent Monte Carlo research (Barcikowski & Stevens, 1975; Huberty, 1975) has shown that the discriminant function coefficients and the structure coefficients are about equally unstable, unless the n is fairly large (e.g., if there are 20 times more cases than there are variables). The most important thing to remember is that the discriminant function coefficients denote the unique (partial) contribution of each variable to the discriminant function(s), while the structure coefficients denote the simple correlations between the variables and the function(s). If one wants to assign substantive “meaningful” labels to the discriminant functions (akin to the interpretation of factors in factor analysis), then the structure coefficients should be used (interpreted); if one wants to learn what is each variable’s unique contribution to the discriminant function, use the discriminant function coefficients (weights).

Significance of discriminant functions. One can test the number of roots that add significantly to the discrimination between group. Only those found to be statistically significant should be used for interpretation; non-significant functions (roots) should be ignored.

Summary. To summarize, when interpreting multiple discriminant functions, which arise from analyses with more than two groups and more than one variable, one would first test the different functions for statistical significance, and only consider the significant functions for further examination. Next, we would look at the standardized b coefficients for each variable for each significant function. The larger the standardized b coefficient, the larger is the respective variable’s unique contribution to the discrimination specified by the respective discriminant function. In order to derive substantive “meaningful” labels for the discriminant functions, one can also examine the factor structure matrix with the correlations between the variables and the discriminant functions. Finally, we would look at the means for the significant discriminant functions in order to determine between which groups the respective functions seem to discriminate.

Assumptions

As mentioned earlier, discriminant function analysis is computationally very similar to MANOVA, and all assumptions for MANOVA mentioned in ANOVA/MANOVA apply. In fact, you may use the wide range of diagnostics and statistical tests of assumption that are available to examine your data for the discriminant analysis.

Normal distribution. It is assumed that the data (for the variables) represent a sample from a multivariate normal distribution. You can examine whether or not variables are normally distributed with histograms of frequency distributions. However, note that violations of the normality assumption are usually not “fatal,” meaning, that the resultant significance tests etc. are still “trustworthy.” You may use specific tests for normality in addition to graphs.

Homogeneity of variances/covariances. It is assumed that the variance/covariance matrices of variables are homogeneous across groups. Again, minor deviations are not that important; however, before accepting final conclusions for an important study it is probably a good idea to review the within-groups variances and correlation matrices. In particular a scatterplot matrix can be produced and can be very useful for this purpose. When in doubt, try re-running the analyses excluding one or two groups that are of less interest. If the overall results (interpretations) hold up, you probably do not have a problem. You may also use the numerous tests available to examine whether or not this assumption is violated in your data. However, as mentioned in ANOVA/MANOVA, the multivariate Box M test for homogeneity of variances/covariances is particularly sensitive to deviations from multivariate normality, and should not be taken too “seriously.”

Correlations between means and variances. The major “real” threat to the validity of significance tests occurs when the means for variables across groups are correlated with the variances (or standard deviations). Intuitively, if there is large variability in a group with particularly high means on some variables, then those high means are not reliable. However, the overall significance tests are based on pooled variances, that is, the average variance across all groups. Thus, the significance tests of the relatively larger means (with the large variances) would be based on the relatively smaller pooled variances, resulting erroneously in statistical significance. In practice, this pattern may occur if one group in the study contains a few extreme outliers, who have a large impact on the means, and also increase the variability. To guard against this problem, inspect the descriptive statistics, that is, the means and standard deviations or variances for such a correlation.

The matrix ill-conditioning problem. Another assumption of discriminant function analysis is that the variables that are used to discriminate between groups are not completely redundant. As part of the computations involved in discriminant analysis, you will invert the variance/covariance matrix of the variables in the model. If any one of the variables is completely redundant with the other variables then the matrix is said to be ill-conditioned, and it cannot be inverted. For example, if a variable is the sum of three other variables that are also in the model, then the matrix is ill-conditioned.

Tolerance values. In order to guard against matrix ill-conditioning, constantly check the so-called tolerance value for each variable. This tolerance value is computed as 1 minus R-square of the respective variable with all other variables included in the current model. Thus, it is the proportion of variance that is unique to the respective variable. You may also refer to Multiple Regression to learn more about multiple regression and the interpretation of the tolerance value. In general, when a variable is almost completely redundant (and, therefore, the matrix ill-conditioning problem is likely to occur), the tolerance value for that variable will approach 0.

Classification

Another major purpose to which discriminant analysis is applied is the issue of predictive classification of cases. Once a model has been finalized and the discriminant functions have been derived, how well can we predict to which group a particular case belongs?

A priori and post hoc predictions. Before going into the details of different estimation procedures, we would like to make sure that this difference is clear. Obviously, if we estimate, based on some data set, the discriminant functions that best discriminate between groups, and then use the same data to evaluate how accurate our prediction is, then we are very much capitalizing on chance. In general, one will always get a worse classification when predicting cases that were not used for the estimation of the discriminant function. Put another way, post hoc predictions are always better than a priori predictions. (The trouble with predicting the future a priori is that one does not know what will happen; it is much easier to find ways to predict what we already know has happened.) Therefore, one should never base one’s confidence regarding the correct classification of future observations on the same data set from which the discriminant functions were derived; rather, if one wants to classify cases predictively, it is necessary to collect new data to “try out” (cross-validate) the utility of the discriminant functions.

Classification functions. These are not to be confused with the discriminant functions. The classification functions can be used to determine to which group each case most likely belongs. There are as many classification functions as there are groups. Each function allows us to compute classification scores for each case for each group, by applying the formula:

Si = ci + wi1*x1 + wi2*x2 + … + wim*xm

In this formula, the subscript i denotes the respective group; the subscripts 1, 2, …, m denote the m variables; ci is a constant for the i‘th group, wij is the weight for the j‘th variable in the computation of the classification score for the i‘th group; xj is the observed value for the respective case for the j‘th variable. Si is the resultant classification score.

We can use the classification functions to directly compute classification scores for some new observations.

Classification of cases. Once we have computed the classification scores for a case, it is easy to decide how to classify the case: in general we classify the case as belonging to the group for which it has the highest classification score (unless the a priori classification probabilities are widely disparate; see below). Thus, if we were to study high school students’ post-graduation career/educational choices (e.g., attending college, attending a professional or trade school, or getting a job) based on several variables assessed one year prior to graduation, we could use the classification functions to predict what each student is most likely to do after graduation. However, we would also like to know the probability that the student will make the predicted choice. Those probabilities are called posterior probabilities, and can also be computed. However, to understand how those probabilities are derived, let us first consider the so-called Mahalanobis distances.

Mahalanobis distances. You may have read about these distances in other parts of the manual. In general, the Mahalanobis distance is a measure of distance between two points in the space defined by two or more correlated variables. For example, if there are two variables that are uncorrelated, then we could plot points (cases) in a standard two-dimensional scatterplot; the Mahalanobis distances between the points would then be identical to the Euclidean distance; that is, the distance as, for example, measured by a ruler. If there are three uncorrelated variables, we could also simply use a ruler (in a 3-D plot) to determine the distances between points. If there are more than 3 variables, we cannot represent the distances in a plot any more. Also, when the variables are correlated, then the axes in the plots can be thought of as being non-orthogonal; that is, they would not be positioned in right angles to each other. In those cases, the simple Euclidean distance is not an appropriate measure, while the Mahalanobis distance will adequately account for the correlations.

Mahalanobis distances and classification. For each group in our sample, we can determine the location of the point that represents the means for all variables in the multivariate space defined by the variables in the model. These points are called group centroids. For each case we can then compute the Mahalanobis distances (of the respective case) from each of the group centroids. Again, we would classify the case as belonging to the group to which it is closest, that is, where the Mahalanobis distance is smallest.

Posterior classification probabilities. Using the Mahalanobis distances to do the classification, we can now derive probabilities. The probability that a case belongs to a particular group is basically proportional to the Mahalanobis distance from that group centroid (it is not exactly proportional because we assume a multivariate normal distribution around each centroid). Because we compute the location of each case from our prior knowledge of the values for that case on the variables in the model, these probabilities are called posterior probabilities. In summary, the posterior probability is the probability, based on our knowledge of the values of other variables, that the respective case belongs to a particular group. Some software packages will automatically compute those probabilities for all cases (or for selected cases only for cross-validation studies).

A priori classification probabilities. There is one additional factor that needs to be considered when classifying cases. Sometimes, we know ahead of time that there are more observations in one group than in any other; thus, the a priori probability that a case belongs to that group is higher. For example, if we know ahead of time that 60% of the graduates from our high school usually go to college (20% go to a professional school, and another 20% get a job), then we should adjust our prediction accordingly: a priori, and all other things being equal, it is more likely that a student will attend college that choose either of the other two options. You can specify different a priori probabilities, which will then be used to adjust the classification of cases (and the computation of posterior probabilities) accordingly.

In practice, the researcher needs to ask him or herself whether the unequal number of cases in different groups in the sample is a reflection of the true distribution in the population, or whether it is only the (random) result of the sampling procedure. In the former case, we would set the a priori probabilities to be proportional to the sizes of the groups in our sample, in the latter case we would specify the a priori probabilities as being equal in each group. The specification of different a priori probabilities can greatly affect the accuracy of the prediction.

Summary of the prediction. A common result that one looks at in order to determine how well the current classification functions predict group membership of cases is the classification matrix. The classification matrix shows the number of cases that were correctly classified (on the diagonal of the matrix) and those that were misclassified.

Another word of caution. To reiterate, post hoc predicting of what has happened in the past is not that difficult. It is not uncommon to obtain very good classification if one uses the same cases from which the classification functions were computed. In order to get an idea of how well the current classification functions “perform,” one must classify (a priori) different cases, that is, cases that were not used to estimate the classification functions. You can include or exclude cases from the computations; thus, the classification matrix can be computed for “old” cases as well as “new” cases. Only the classification of new cases allows us to assess the predictive validity of the classification functions (see also cross-validation); the classification of old cases only provides a useful diagnostic tool to identify outliers or areas where the classification function seems to be less adequate.

Summary. In general Discriminant Analysis is a very useful tool (1) for detecting the variables that allow the researcher to discriminate between different (naturally occurring) groups, and (2) for classifying cases into different groups with a better than chance accuracy.

Dear STATISTICA user,

• STATISTICA v10 is a comprehensive, integrated data analysis, graphics, database management, & custom application development system featuring a wide selection of basic & advanced analytic procedures for business, data mining, science, & engineering applications.
• State-of-the-art software is costly & should be kept updated. Students must be availed of the latest training, ensuring you the maximum benefit on your investment.
• Professional training by experts in statistics, methodology & practical applications, as well as in “tips & tricks”, can greatly enhance user productivity. Expert training guides users/prospective-users through the wealth of functionality the program offers. The streamlining of your workflow could improve 100%, saving the both time & money.
• Our training services help improve analytical skills, whether a beginner or advanced user. StatSoft educators represent the world’s finest analytic expertise drawn from industry to academic institutions.
• Our 2010 training encompasses the latest, up-to-date, time & cost-saving techniques, keeping you abreast of the latest techniques & world-trends, & bringing value to your organization.

Getting started with Statistica, Tutorials, Popular Videos!

No need to feel lost getting started with STATISTICA! We’ve got you covered with our popular videos on text mining, data mining, and all things analytic.

Video Tutorials

Introductory Overview

Welcome to STATISTICA, where every analysis you will ever need is at your fingertips. Used around the world in at least 30 countries, StatSoft’s STATISTICA line of software has gained unprecedented recognition by users and reviewers. In addition to both basic and advanced statistics, STATISTICA products offer specialized tools for analyzing neural networks, determining sample size, designing experiments, creating real-time quality control charts, reporting via the Web, and much more . . . the possibilities are endless.

Video Title
Use the Analysis Toolbar
In this demonstration, see the benefits of convenient multi-tasking functionality in STATISTICA. Run multiple copies of STATISTICA at the same time, run multiple analyses of the same or different kinds, run analyses on the same or different data files, or do all three.
Save and Retrieve Projects
STATISTICA Projects provide the means to save your work and return to it later. A project is a “snapshot” of STATISTICA at the time it was saved: input data, results including graphs, spreadsheets, workbooks and reports, and data miner workspaces. This tutorial explains how projects are used.
Use Variable Bundles
With the Variable Bundles Manager, you can easily create bundles of variables in order to organize large sets of variables and to facilitate the repeated selection of the same set of variables. By creating bundles, you can quickly and easily locate a subset of data in a large data file.
Perform By Group Analysis
With STATISTICA, you can generate output for each unique level of a By variable or unique combination of multiple By variables at the individual results level. This makes it very easy to compare results of an analysis across different groups.
Select Subsets of Cases
In this demonstration, see the extremely flexible facilities for case selection provided in STATISTICA. You can specify cases in two different ways, either temporarily, only for the duration of a single analysis, or more permanently for all subsequent analyses using the current spreadsheet.
Data Filtering/Cleaning
Data Cleaning is an important first step in Data Mining and general analysis projects.  This tutorial illustrates several of the data cleaning tools of STATISTICA.
Variables in STATISTICA Spreadsheets can be defined by formulas that support a wide selection of mathematical, logical, and statistical functions. Furthermore, STATISTICA provides the option to automatically or manually recalculate all spreadsheet functions as the data change. In this demonstration, see how STATISTICA’s “type ahead” feature recognizes functions and prompts for the necessary parameters.
Select Output
What is your preference for showing the results of your analyses? See how the various output options in STATISTICA let you work the way you want. View your results in individual windows, store output in a convenient workbook, or annotate results in a presentation-quality report. The Output Manager gives you complete control and remembers your preferences.
> Microsoft Office Integration
STATISTICA is fully integrated with Microsoft Office products. This demonstration shows how to output STATISTICA results to Microsoft Word and open a Microsoft Excel document in STATISTICA.
Workbook Multi-item Display
STATISTICA multi-item display enables you to quickly view and edit all documents within a workbook. This video demonstrates how to view multi-item displays, print and save multi-item displays as PDF files, and customize STATISTICA documents within the multi-item display grid.
Reports In PDF Format
With STATISTICA, you can easily create reports in Acrobat (PDF) format for all STATISTICA document types. This powerful feature enables you to share documents with colleagues who have a PDF Reader such as Adobe Acrobat Reader. This video demonstrates how to save and print all STATISTICA document types as PDF files.
Categories of Graphs
In addition to the specialized statistical graphs that are available through all results dialog boxes in STATISTICA, there are two general types of graphs in STATISTICA. In this demonstration, these two graph types are explored: Graphs of Input Data, which visualize or summarize raw values from input data spreadsheets, and Graphs of Block Data, which visualize arbitrarily selected blocks of values, regardless of their source.
Auto-Update Graphs
The dynamic features to automatically update graphs facilitate the visual exploration of data. STATISTICA Graphs are dynamically linked to the data. Thus, when the data change, the graphs automatically update. This video demonstration explores how this functionality can be used for data correction and how to glean important patterns visually from the data, as well as how to create custom graph templates.
Create Random Sub-Samples
During exploratory analysis of very large data sets, it may be best to perform a variety of preliminary analyses using a subset of data. When all the data cases are equally important and a smaller but fully representative subset of the data is sufficient, it is beneficial to use STATISTICA’s options for creating new data files containing random subsets of data contained in the parent files. See how a random subset is created from a file containing 100,000 data cases.
Use Microscrolls
In this demonstration, see how microscrolls, a flexible interface with full mouse and keyboard support, aid interactive input of numerical values in STATISTICA. Microscrolls are available in every dialog with numerical input options, and greatly increase the speed and efficiency of the user interface.
ActiveX Controls
With STATISTICA, you can embed Active X controls into graphs. Active X controls provide the capability to create a custom user interface. This video demonstrates the use of a slider control and how it can be used to create a highly interactive graph.
Web-Browser
This demonstration shows how browser windows in STATISTICA are useful for viewing STATISTICA Enterprise Server reports, as well as viewing custom-made web interfaces that seamlessly interact with STATISTICA.

Predictive Modeling Solutions for Banking Industry

To understand customer needs, preferences, and behaviors, financial institutions such as banks, mortgage lenders, credit card companies, and investment advisors are turning to the powerful data mining techniques in . These techniques help companies in the financial sector to uncover hidden trends and explain the patterns that affect every aspect of their overall success.

Financial institutions have long collected detailed customer data – oftentimes in many disparate databases and in various formats. Only with the recent advances in database technology and data mining software have financial institutions acquired the necessary tools to manage their risks using all available information, and exploring a wide range of scenarios. Now, business strategies in financial institutions are developed more intelligently than ever before.

Risk Management, Credit Scorecard

aids with the development, evaluation and monitoring of scorecard models.

Fraud Detection

Banking fraud attempts have seen a drastic increase in recent years, making fraud detection more important than ever. Despite efforts on the part of financial institutions, hundreds of millions of dollars are lost to fraud every year.

helps banks and financial institutions to anticipate and quickly detect fraud and take immediate action to minimize costs. Through the use of sophisticated data mining tools, millions of transactions can be searched to spot patterns and detect fraudulent transactions.

Identify causes of risk; create sophisticated and automated models of risk.

• Segment and predict behavior of homogeneous (similar) groups of customers.
• Uncover hidden correlations between different indicators.
• Create models to price futures, options, and stocks.
• Optimize portfolio performance.

Tools and Techniques

STATISTICA Data Miner will empower your organization to provide better services and enhance the profitability of all aspects of your customer relationships. Predict customer behaviour with STATISTICA Data Miner’s General Classifier and Regression tools to find rules for organizing customers into classes or groups. Find out who your most profitable, loyal customers are and who is more likely to default on loans or miss a payment. Apply state-of-the-art techniques to build and compare a wide variety of , or models.

Recognize patterns, segments, and clusters with STATISTICA Data Miner’s Cluster Analysis options and Generalized EM (Expectation Maximization) and K-means Clustering module. For example, clustering methods may help build a customer segmentation model from large data sets. Use the various methods for mapping customers and/or characteristics of customers and customer interactions, such as ., to detect the general rules that apply to your exchanges with your customers.

STATISTICA Data Miner’s powerful Explorer offers tools including classification, hidden structure detection, and forecasting coupled with an Intelligent Wizard to make even the most complex problems and advanced analyses seem easier.

Uncover the most important variables from among thousands of potential measures with Data Miner’s Feature Selection and Variable Filtering module, or simplify the data variables and fields using the Principal Components Analysis or Partial Least Squares modules.

STATISTICA Data Miner also features , Neural Networks, ARIMA, Exponentially Weighted Moving Average, Fourier Analysis, and many others. Learn from the data available to you, provide better services, and gain competitive advantages when you apply the absolute state-of-the-art in data mining techniques such as generalized linear and additive models, MARSplines, boosted trees, etc.

STATISTICA Multivariate Exploratory Techniques

STATISTICA Multivariate Exploratory Techniques offers a broad selection of exploratory techniques, from cluster analysis to advanced classification trees methods, with an endless array of interactive visualization tools for exploring relationships and patterns; built-in complete Visual Basic scripting.

• Cluster Analysis Techniques
• Factor Analysis
• Principal Components & Classification Analysis
• Canonical Correlation Analysis
• Reliability/Item Analysis
• Classification Trees
• Correspondence Analysis
• Multidimensional Scaling
• Discriminant Analysis
• General Discriminant Analysis Models (GDA)

Details

Cluster Analysis

This module includes a comprehensive implementation of clustering methods (k-means, hierarchical clustering, two-way joining). The program can process data from either raw data files or matrices of distance measures. The user can cluster cases, variables, or both based on a wide variety of distance measures (including Euclidean, squared Euclidean, City-block (Manhattan), Chebychev, Power distances, Percent disagreement, and 1-r) and amalgamation/linkage rules (including single, complete, weighted and unweighted group average or centroid, Ward’s method, and others). Matrices of distances can be saved for further analysis with other modules of the STATISTICA system. In k-means clustering, the user has full control over the initial cluster centers. Extremely large analysis designs can be processed; for example, hierarchical (tree) joining can analyze matrices with over 1,000 variables, or with over 1 million distances. In addition to the standard cluster analysis output, a comprehensive set of descriptive statistics and extended diagnostics (e.g., the complete amalgamation schedule with cohesion levels in hierarchical clustering, the ANOVA table in k-means clustering) is available. Cluster membership data can be appended to the current data file for further processing. Graphics options in the Cluster Analysis module include customizable tree diagrams, discrete contour-style two-way joining matrix plots, plots of amalgamation schedules, plots of means in k-means clustering, and many others.

Factor Analysis

The Factor Analysis module contains a wide range of statistics and options, and provides a comprehensive implementation of factor (and hierarchical factor) analytic techniques with extended diagnostics and a wide variety of analytic and exploratory graphs. It will perform principal components, common, and hierarchical (oblique) factor analysis, and can handle extremely large analysis problems (e.g., with thousands of variables). Confirmatory factor analysis (as well as path analysis) can also be performed via the Structural Equation Modeling and Path Analysis (SEPATH) module found in STATISTICA Advanced Linear/Non-Linear Models.

Principal Components & Classification Analysis

STATISTICA also includes a designated program for principal components and classification analysis. The output includes eigenvalues (regular, cumulative, relative), factor loadings, factor scores (which can be appended to the input data file, reviewed graphically as icons, and interactively recoded), and a number of more technical statistics and diagnostics. Available rotations include Varimax, Equimax, Quartimax, Biquartimax (either normalized or raw), and Oblique rotations. The factorial space can be plotted and reviewed “slice by slice” in either 2D or 3D scatterplots with labeled variable-points; other integrated graphs include Scree plots, various scatterplots, bar and line graphs, and others. After a factor solution is determined, the user can recalculate (i.e., reconstruct) the correlation matrix from the respective number of factors to evaluate the fit of the factor model. Both raw data files and matrices of correlations can be used as input. Confirmatory factor analysis and other related analyses can be performed with the Structural Equation Modeling and Path Analysis (SEPATH) module available in STATISTICA Advanced Linear/Non-Linear Models, where a designated Confirmatory Factor Analysis Wizard will guide you step by step through the process of specifying the model.

Canonical Correlation Analysis

This module offers a comprehensive implementation of canonical analysis procedures; it can process raw data files or correlation matrices and it computes all of the standard canonical correlation statistics (including eigenvectors, eigenvalues, redundancy coefficients, canonical weights, loadings, extracted variances, significance tests for each root, etc.) and a number of extended diagnostics. The scores of canonical variates can be computed for each case, appended to the data file, and visualized via integrated icon plots. The Canonical Analysis module also includes a variety of integrated graphs (including plots of eigenvalues, canonical correlations, scatterplots of canonical variates, and many others). Note that confirmatory analyses of structural relationships between latent variables can also be performed via the SEPATH (Structural Equation Modeling and Path Analysis) module in STATISTICA Advanced Linear/Non-Linear Models. Advanced stepwise and best-subset selection of predictor variables for MANOVA/MANCOVA designs (with multiple dependent variables) is available in the General Regression Models (GRM) module in STATISTICA Advanced Linear/Non-Linear Models.

Reliability/Item Analysis

This module includes a comprehensive selection of procedures for the development and evaluation of surveys and questionnaires. As in all other modules of STATISTICA, extremely large designs can be analyzed. The user can calculate reliability statistics for all items in a scale, interactively select subsets, or obtain comparisons between subsets of items via the “split-half” (or split-part) method. In a single run, the user can evaluate the reliability of a sum-scale as well as subscales. When interactively deleting items, the new reliability is computed instantly without processing the data file again. The output includes correlation matrices and descriptive statistics for items, Cronbach alpha, the standardized alpha, the average inter-item correlation, the complete ANOVA table for the scale, the complete set of item-total statistics (including multiple item-total R‘s), the split-half reliability, and the correlation between the two halves corrected for attenuation. A selection of graphs (including various integrated scatterplots, histograms, line plots and other plots) and a set of interactive what-if procedures are provided to aid in the development of scales. For example, the user can calculate the expected reliability after adding a particular number of items to the scale, and can estimate the number of items that would have to be added to the scale in order to achieve a particular reliability. Also, the user can estimate the correlation corrected for attenuation between the current scale and another measure (given the reliability of the current scale).

Classification Trees

STATISTICA’s Classification Trees module provides a comprehensive implementation of the most recently developed algorithms for efficiently producing and testing the robustness of classification trees (a classification tree is a rule for predicting the class of an object from the values of its predictor variables). STATISTICA Data Miner offers additional advanced methods for tree classifications such as Boosted Trees, Random Forests, General Classification and Regression Tree Models (GTrees) and General CHAID (Chi-square Automatic Interaction Detection) models facilities. Classification trees can be produced using categorical predictor variables, ordered predictor variables, or both, and using univariate splits or linear combination splits.

Analysis options include performing exhaustive splits or discriminant-based splits; unbiased variable selection (as in QUEST); direct stopping rules (as in FACT) or bottom-up pruning (as in C&RT); pruning based on misclassification rates or on the deviance function; generalized Chi-square, G-square, or Gini-index goodness of fit measures. Priors and misclassification costs can be specified as equal, estimated from the data, or user-specified. The user can also specify the v value for v-fold cross-validation during tree building, v value for v-fold cross-validation for error estimation, size of the SE rule, minimum node size before pruning, seeds for random number generation, and alpha value for variable selection. Integrated graphics options are provided to explore the input and output data.

Correspondence Analysis

This module features a full implementation of simple and multiple correspondence analysis techniques, and can analyze even extremely large tables. The program will accept input data files with grouping (coding) variables that are to be used to compute the crosstabulation table, data files that contain frequencies (or some other measure of correspondence, association, similarity, confusion, etc.) and coding variables that identify (enumerate) the cells in the input table, or data files with frequencies (or other measure of correspondence) only (e.g., the user can directly type in and analyze a frequency table). For multiple correspondence analysis, the user can also directly specify a Burt table as input for the analysis. The program will compute various tables, including the table of row percentages, column percentages, total percentages, expected values, observed minus expected values, standardized deviates, and contributions to the Chi-square values. The Correspondence Analysis module will compute the generalized eigenvalues and eigenvectors, and report all standard diagnostics including the singular values, eigenvalues, and proportions of inertia for each dimension. The user can either manually choose the number of dimensions, or specify a cutoff value for the maximum cumulative percent of inertia. The program will compute the standard coordinate values for column and row points. The user has the choice of row-profile standardization, column-profile standardization, row and column profile standardization, or canonical standardization. For each dimension and row or column point, the program will compute the inertia, quality, and cosine-square values. In addition, the user can display (in spreadsheets) the matrices of the generalized singular vectors; like the values in all spreadsheets, these matrices can be accessed via STATISTICA Visual Basic, for example, in order to implement non-standard methods of computing the coordinates. The user can compute coordinate values and related statistics (quality and cosine-square values) for supplementary points (row or column), and compare the results with the regular row and column points. Supplementary points can also be specified for multiple correspondence analysis. In addition to the 3D histograms that can be computed for all tables, the user can produce a line plot for the eigenvalues, and 1D, 2D, and 3D plots for the row or column points. Row and column points can also be combined in a single graph, along with any supplementary points (each type of point will use a different color and point marker, so the different types of points can easily be identified in the plots). All points are labeled, and an option is available to truncate the names for the points to a user-specified number of characters.

Multidimensional Scaling

The Multidimensional Scaling module includes a full implementation of (nonmetric) multidimensional scaling. Matrices of similarities, dissimilarities, or correlations between variables (i.e., “objects” or cases) can be analyzed. The starting configuration can be computed by the program (via principal components analysis) or specified by the user. The program employs an iterative procedure to minimize the stress value and the coefficient of alienation. The user can monitor the iterations and inspect the changes in these values. The final configurations can be reviewed via spreadsheets, and via 2D and 3D scatterplots of the dimensional space with labeled item-points. The output includes the values for the raw stress (raw F), Kruskal stress coefficient S, and the coefficient of alienation. The goodness of fit can be evaluated via Shepard diagrams (with d-hats and d-stars). Like all other results in STATISTICA, the final configuration can be saved to a data file.

Discriminant Analysis

The Discriminant Analysis module is a full implementation of multiple stepwise discriminant function analysis. STATISTICA also includes the General Discriminant Analysis Models module (below) for fitting ANOVA/ANCOVA-like designs to categorical dependent variables, and to perform various advanced types of analyses (e.g., best subset selection of predictors, profiling of posterior probabilities, etc.). The Discriminant Analysis program will perform forward or backward stepwise analyses, or enter user-specified blocks of variables into the model.

In addition to the numerous graphics and diagnostics describing the discriminant functions, the program also provides a wide range of options and statistics for the classification of old or new cases (for validation of the model). The output includes the respective Wilks’ lambdas, partial lambdas, F to enter (or remove), the p levels, the tolerance values, and the R-square. The program will perform a full canonical analysis and report the raw and cumulative eigenvalues for all roots, and their p levels, the raw and standardized discriminant (canonical) function coefficients, the structure coefficient matrix (of factor loadings), the means for the discriminant functions, and the discriminant scores for each case (which can also be automatically appended to the data file). Integrated graphs include histograms of the canonical scores within each group (and all groups combined), special scatterplots for pairs of canonical variables (where group membership of individual cases is visibly marked), a comprehensive selection of categorized (multiple) graphs allowing the user to explore the distribution and relations between dependent variables across the groups (including multiple box-and-whisker plots, histograms, scatterplots, and probability plots), and many others. The Discriminant Analysis module will also compute the standard classification functions for each group. The classification of cases can be reviewed in terms of Mahalanobis distances, posterior probabilities, or actual classifications, and the scores for individual cases can be visualized via exploratory icon plots and other multidimensional graphs integrated directly with the results spreadsheets. All of these values can be automatically appended to the current data file for further analyses. The summary classification matrix of the number and percent of correctly classified cases can also be displayed. The user has several options to specify the a priori classification probabilities and can specify selection conditions to include or exclude selected cases from the classification (e.g., to validate the classification functions in a new sample).

General Discriminant Analysis Models (GDA)

The STATISTICA General Discriminant Analysis (GDA) module is an application and extension of the General Linear Model to classification problems. Like the Discriminant Analysis module, GDA allows you to perform standard and stepwise discriminant analyses. GDA implements the discriminant analysis problem as a special case of the general linear model, and thereby offers extremely useful analytic techniques that are innovative, efficient, and extremely powerful. As in traditional discriminant analysis, GDA allows you to specify a categorical dependent variable. For the analysis, the group membership (with regard to the dependent variable) is then coded into indicator variables, and all methods of GRM can be applied. In the results dialogs, the extensive selection of residual statistics of GRM and GLM are available in GDA as well. GDA provides powerful and efficient tools for data mining as well as applied research. GDA will compute all standard results for discriminant analysis, including discriminant function coefficients, canonical analysis results (standardized and raw coefficients, step-down tests of canonical roots, etc.), classification statistics (including Mahalanobis distances, posterior probabilities, actual classification of cases in the analysis sample and validation sample, misclassification matrix, etc.), and so on.

Computational approach and unique applications. As in traditional discriminant analysis, GDA allows you to specify a categorical dependent variable. For the analysis, the group membership (with regard to the dependent variable) is then coded into indicator variables, and all methods of GRM can be applied. In the results dialogs, the extensive selection of residual statistics of GRM and GLM are available in GDA as well; for example, you can review all the regression-like residuals and predicted values for each group (each coded dependent indicator variable), and choose from the large number of residual plots. In addition, all specialized prediction and classification statistics are computed that are commonly reviewed in a discriminant analysis; but those statistics can be reviewed in innovate ways because of STATISTICA’s unique approach. For example, you can perform “desirability profiling” by combining the posterior prediction probabilities for the groups into a desirability score, and then let the program find the values or combination of categorical predictor settings that will optimize that score. Thus, GDA provides powerful and efficient tools for data mining as well as applied research; for example, you could use the DOE (Design of Experiments) methods to generate an experimental design for quality improvement, apply this design to categorical outcome data (e.g., distinct classifications of an outcome as “superior,” “acceptable,” or “failed”), and then model the posterior prediction probabilities of those outcomes using the variables of your experimental design.

Standard discriminant analysis results. STATISTICA GDA will compute all standard results for discriminant analysis, including discriminant function coefficients, canonical analysis results (standardized and raw coefficients, step-down tests of canonical roots, etc.), classification statistics (including Mahalanobis distances, posterior probabilities, actual classification of cases in the analysis sample and validation sample, misclassification matrix, etc.), and so on.

Unique features of GDA, currently only available in STATISTICA. In addition, STATISTICA GDA includes numerous unique features and results:

Specifying predictor variables and effects; model building:

1. Support for continuous and categorical predictors, instead of allowing only continuous predictors in the analysis (the common limitation in traditional discriminant function analysis programs), GDA allows the user to specify simple and complex ANOVA and ANCOVA-like designs, e.g., mixtures of continuous and categorical predictors, polynomial (response surface) designs, factorial designs, nested designs, etc.

2. Multiple-degree of freedom effects in stepwise selection; the terms that make up the predictor set (consisting not only of single-degree of freedom continuous predictors, but also multiple-degree of freedom effects) can be used in stepwise discriminant function analyses; multiple-degree of freedom effects will always be entered/removed as blocks.

3. Best subset selection of predictor effects; single- and multiple-degree of freedom effects can be specified for best-subset discriminant analysis; the program will select the effects (up to a user-specified number of effects) that produce the best discrimination between groups.

4. Selection of predictor effects based on misclassification rates; GDA allows the user to perform model building (selection of predictor effects) not only based on traditional criteria (e.g., p-to-enter/remove; Wilks’ lambda), but also based on misclassification rates; in other words the program will select those predictor effects that maximize the accuracy of classification, either for those cases from which the parameter estimates were computed, or for a cross-validation sample (to guard against over-fitting); these techniques elevate GDA to the level of a fast neural-network-like data mining tool for classification, that can be used as an alternative to other similar techniques (tree-classifiers, designated neural-network methods, etc.; GDA will tend to be faster than those techniques because it is still based on the more efficient General Linear Model).

Results statistics; profiling:

1. Detailed results and diagnostic statistics and plots; in addition to the standard results statistics, GDA provides a large number of auxiliary information to help the user judge the adequacy of the chosen disciminant analysis model (descriptive statistics and graphs, Mahalanobis distances, Cook distances, and leverages for predictors, etc.). 2. Profiling of expected classification; GDA includes an adaptation of the general GLM (GRM) response profiler; these options allow the user to quickly determine the values (or levels) of the predictor variables that maximize the posterior classification probability for a single group, or for a set of groups in the analyses; in a sense, the user can quickly determine the typical profiles of values of the predictors (or levels of categorical predictors) that identify a group (or set of groups) in the analysis.

A note of caution for models with categorical predictors, and other advanced techniques. The General Discriminant Analysis module provides functionality that makes this technique a general tool for classification and data mining. However, most — if not all — textbook treatments of discriminant function analysis are limited to simple and stepwise analyses with single degree of freedom continuous predictors. No “experience” (in the literature) exists regarding issues of robustness and effectiveness of these techniques, when they are generalized in the manner provided in this very powerful module. The use of best-subset methods, in particular when used in conjunction with categorical predictors or when using the misclassification rates in a crossvalidation sample for choosing the best subset of predictors, should be considered a heuristic search method, rather than a statistical analysis technique.

System Requirements

STATISTICA Multivariate Exploratory Techniques is compatible with Windows XP, Windows Vista, and Windows 7.

Minimum System Requirements

• Operating System: Windows XP or above
• RAM: 256 MB
• Processor Speed: 500 MHz

Recommended System Requirements

• Operating System: Windows XP or above
• RAM: 1 GB
• Processor Speed: 2.0 GHz

Native 64-bit versions and highly optimized multiprocessor versions are available.

Is There Big Money in Big Data?

Many entrepreneurs foresee vast profits in mining data from online activity and mobile devices. One Wharton business school professor strongly disagrees.

Cut the nonsense: Peter Fader says a flood of consumer data collected from mobile devices may not help marketers as much as they think.
Wharton/Peter Olson

See the rest of our Business Impact report on Mobile Computing in Question.

Few ideas hold more sway among entrepreneurs and investors these days than “Big Data.” The idea is that we are now collecting so much information about people from their online behavior and, especially, through their mobile phones that we can make increasingly specific predictions about how they will behave and what they will buy.

But are those assumptions really true? One doubter is Peter Fader, codirector of the Wharton Customer Analytics Initiative at the University of Pennsylvania, where he is also a professor of marketing. Fader shared some of his concerns in an interview with reporter Lee Gomes.

TR: How would you describe the prevailing idea about Big Data inside the tech community?

Fader: “More is better.” If you can give me more data about a customer—if you can capture more aspects of their behavior, their connections with others, their interests, and so on—then I can pin down exactly what this person is all about. I can anticipate what they will buy, and when, and for how much, and through what channel.

So what exactly is wrong with that?

It reminds me a lot of what was going on 15 years ago with CRM (customer relationship management). Back then, the idea was “Wow, we can start collecting all these different transactions and data, and then, boy, think of all the predictions we will be able to make.” But ask anyone today what comes to mind when you say “CRM,” and you’ll hear “frustration,” “disaster,” “expensive,” and “out of control.” It turned out to be a great big IT wild-goose chase. And I’m afraid we’re heading down the same road with Big Data.

There seem to be a lot of businesses these days that promise to take a Twitter stream or a collection of Facebook comments and then make some prediction: about a stock price, about how a product will be received in the market.

That is all ridiculous. If you can get me a really granular view of data—for example, an individual’s tweets and then that same individual’s transactions, so I can see how they are interacting with each other—that’s a whole other story. But that isn’t what is happening. People are focusing on sexy social-media stuff and pushing it much further than they should be.

Some say the data fetish you’re describing is especially epidemic with the many startups connected with mobile computing. Do you think that’s true? And if so, wouldn’t it suggest that a year or two from now, there are going to be a lot of disappointed entrepreneurs and VCs?

There is a “data fetish” with every new trackable technology, from e-mail and Web browsing in the ’90s all the way through mobile communications and geolocation services today. Too many people think that mobile is a “whole new world,” offering stunning insights into behaviors that were inconceivable before. But many of the basic patterns are surprisingly consistent across these platforms. That doesn’t make them uninteresting or unimportant. But the basic methods we can use in the mobile world to understand and forecast these behaviors (and thus the key data needed to accomplish these tasks) are not nearly as radical as many people suspect.

But doesn’t mobile computing provide some forms of data that would be especially helpful, like your location—the fact that at a given moment, you might be shopping in a store? Information like that would seem to be quite valuable.

Absolutely. I’m not a total data Luddite. There’s no question that new technologies will provide all kinds of genuinely useful measures that were previously unattainable. The key question is: Just how much of that data do we really need? For instance, do we need a second-by-second log of the shopper’s location? Would it be truly helpful to integrate this series of observations with other behavioral data (e.g., which products the shopper examined)? Or would this just be nice to know? And how much of this data should we save after the trip is completed?

A true data scientist would have a decent sense of how to answer these questions, with an eye toward practical decision-making. But a Big Data zealot might say, “Save it all—you never know when it might come in handy for a future data-mining expedition.” That’s the distinction that separates “old school” and “new school” analysts.

Surely you’re not against machine learning, which has revolutionized fields like language translation, or new database tools like Hadoop?

I make sure my PhD students learn all these emerging technologies, because they are all very important for certain kinds of tasks. Machine learning is very good at classification—putting things in buckets. If I want to know which brand this person is going to buy next, or if this person is going to vote Republican or Democrat, nothing can touch machine learning, and it’s getting better all the time.

The problem is that there are many decisions that aren’t as easily “bucketized”; for instance, questions about “when” as opposed to “which.” Machine learning can break down pretty dramatically in those tasks. It’s important to have a much broader skill set than just machine learning and database management, but many “big data” people don’t know what they don’t know.

You appear to believe that some of the best work in data science was done long ago.

The golden age for predictive behavior was 40 or 50 years ago, when data were really sparse and companies had to squeeze as much insight as they could from them.

Consider Lester Wunderman, who coined the phrase “direct marketing” in the 1960s. He was doing true data science. He said, “Let’s write down everything we know about this customer: what they bought, what catalogue we sent them, what they paid for it.” It was very hard, because he didn’t have a Hadoop cluster to do it for him.

So what did he discover?

The legacy that he (and other old-school direct marketers) gave us is the still-powerful rubric of RFM: recency, frequency, monetary value.

The “F” and the “M” are obvious. You didn’t need any science for that. The “R” part is the most interesting, because it wasn’t obvious that recency, or the time of the last transaction, should even belong in the triumvirate of key measures, much less be first on the list. But it was discovered that customers who did stuff recently, even if they didn’t do a lot, were more valuable than customers who hadn’t been around for a while. That was a big surprise.

Some of those old models are really phenomenal, even today. Ask anyone in direct marketing about RFM, and they’ll say, “Tell me something I don’t know.” But ask anyone in e-commerce, and they probably won’t know what you’re talking about. Or they will use a lot of Big Data and end up rediscovering the RFM wheel—and that wheel might not run quite as smoothly as the original one.

Big Data and data scientists seem to have such a veneer of respectability.

In investing, you have “technical chartists.” They watch [stock] prices bouncing up and down, hitting what is called “resistance” at 30 or “support” at 20, for example. Chartists are looking at the data without developing fundamental explanations for why those movements are taking place—about the quality of a company’s management, for example.

Among financial academics, chartists tend to be regarded as quacks. But a lot of the Big Data people are exactly like them. They say, “We are just going to stare at the data and look for patterns, and then act on them when we find them.” In short, there is very little real science in what we call “data science,” and that’s a big problem.

Does any industry do it right?

Yes: insurance. Actuaries can say with great confidence what percent of people with your characteristics will live to be 80. But no actuary would ever try to predict when you are going to die. They know exactly where to draw the line.

Even with infinite knowledge of past behavior, we often won’t have enough information to make meaningful predictions about the future. In fact, the more data we have, the more false confidence we will have. Not only won’t our hit rate be perfect, it will be surprisingly low. The important part, as both scientists and businesspeople, is to understand what our limits are and to use the best possible science to fill in the gaps. All the data in the world will never achieve that goal for us.