# Monthly Archives: June 2012

## StatSoft Recognized for “Top Commercial Tool” in Poll

*“For the first time, the number of users of free/open source software exceeded the number of users of commercial software. The usage of Big Data software grew five-fold. R, Excel, and RapidMiner were the most popular tools, **with StatSoft STATISTICA getting the top commercial tool spot.”* – KDnuggets.com

The 13th Annual KDnuggets™ Software Poll asked: **What Analytics, Data Mining, or Big Data software have you used in the past 12 months for a real project (not just evaluation)?**

This May 2012 poll attracted “a very large number of participants and used email verification” to ensure one vote per respondent. Once again, StatSoft’s *STATISTICA* received very high marks, earning “top commercial tool” in this poll.

Complete poll results and analysis can be found at KDnuggets.com.

KDnuggets.com is a data mining portal and newsletter publisher for the data mining community with more than 12,000 subscribers.

## Statistica / Custom Development

StatSoft offers statistical consulting services that can range from a few hours of expert advice on your data analysis to a full service analysis or data mining project. The data preparation and analysis tasks take time and expertise. When these resources are not available to complete the analysis project, StatSoft Statistical Consultants can help. Our consultants are very knowledgeable in statistics and experts with the *STATISTICA* software.

A StatSoft Statistical Consultant will review your project needs and provide an estimate of the time required and cost with no cost or obligation. Email info@statsoft.co.za to discuss your project.

## Consulting Projects

Statistical Consulting encompasses a wide range of services including:

- Designing an experiment or study before data are collected
- Data cleaning and exploration
- Model building and evaluation
- Deployment of models generated from data mining

Statistical Consultants can either provide advice and guidance at any of these steps or perform the tasks. Consulting projects are custom tailored to your specific needs.

## Success Stories

The power industry has been highly successful with implementing *STATISTICA* in conjunction with consulting services. Visit StatSoft Power Solutions to learn more about how companies in the power industry have reduced costs and improved production with * STATISTICA MultiStream*.

## Popular Decision Tree: CHAID Analysis, Automatic Interaction Detection

## General CHAID Introductory Overview

The acronym CHAID stands for *Chi*-squared Automatic Interaction Detector. It is one of the oldest tree classification methods originally proposed by Kass (1980). According to Ripley, 1996, the CHAID algorithm is a descendent of THAID developed by Morgan and Messenger, (1973). CHAID will “build” non-binary trees (i.e., trees where more than two branches can attach to a single root or node), based on a relatively simple algorithm that is particularly well suited for the analysis of larger datasets. Also, because the CHAID algorithm will often effectively yield many multi-way frequency tables (e.g., when classifying a categorical response variable with many categories, based on categorical predictors with many classes), it has been particularly popular in marketing research, in the context of market segmentation studies.

Both CHAID and C&RT techniques will construct trees, where each (non-terminal) node identifies a split condition, to yield optimum prediction (of continuous dependent or response variables) or classification (for categorical dependent or response variables). Hence, both types of algorithms can be applied to analyze regression-type problems or classification-type.

CHAID is a recursive partitioning method.

## Basic Tree-Building Algorithm: CHAID and Exhaustive CHAID

The acronym CHAID stands for *Chi*-squared Automatic Interaction Detector. This name derives from the basic algorithm that is used to construct (non-binary) trees, which for classification problems (when the dependent variable is categorical in nature) relies on the *Chi*-square test to determine the best next split at each step; for regression-type problems (continuous dependent variable) the program will actually compute F-tests. Specifically, the algorithm proceeds as follows:

**Preparing predictors.** The first step is to create categorical predictors out of any continuous predictors by dividing the respective continuous distributions into a number of categories with an approximately equal number of observations. For categorical predictors, the categories (classes) are “naturally” defined.

**Merging categories.** The next step is to cycle through the predictors to determine for each predictor the pair of (predictor) categories that is least significantly different with respect to the dependent variable; for classification problems (where the dependent variable is categorical as well), it will compute a *Chi*-square test (Pearson *Chi*-square); for regression problems (where the dependent variable is continuous), F tests. If the respective test for a given pair of predictor categories is not statistically significant as defined by an alpha-to-merge value, then it will merge the respective predictor categories and repeat this step (i.e., find the next pair of categories, which now may include previously merged categories). If the statistical significance for the respective pair of predictor categories is significant (less than the respective alpha-to-merge value), then (optionally) it will compute a Bonferroni adjusted *p*-value for the set of categories for the respective predictor.

**Selecting the split variable.** The next step is to choose the split the predictor variable with the smallest adjusted *p*-value, i.e., the predictor variable that will yield the most significant split; if the smallest (Bonferroni) adjusted *p*-value for any predictor is greater than some alpha-to-split value, then no further splits will be performed, and the respective node is a terminal node.

Continue this process until no further splits can be performed (given the alpha-to-merge and alpha-to-split values).

**CHAID and Exhaustive CHAID Algorithms.** A modification to the basic CHAID algorithm, called Exhaustive CHAID, performs a more thorough merging and testing of predictor variables, and hence requires more computing time. Specifically, the merging of categories continues (without reference to any alpha-to-merge value) until only two categories remain for each predictor. The algorithm then proceeds as described above in the *Selecting the split variable* step, and selects among the predictors the one that yields the most significant split. For large datasets, and with many continuous predictor variables, this modification of the simpler CHAID algorithm may require significant computing time.

## General Computation Issues of CHAID

**Reviewing large trees: Unique analysis management tools.** A general issue that arises when applying tree classification or regression methods is that the final trees can become very large. In practice, when the input data are complex and, for example, contain many different categories for classification problems, and many possible predictors for performing the classification, then the resulting trees can become very large. This is not so much a computational problem as it is a problem of presenting the trees in a manner that is easily accessible to the data analyst, or for presentation to the “consumers” of the research.

**Analyzing ANCOVA-like designs.** The classic CHAID algorithms can accommodate both continuous and categorical predictor. However, in practice, it is not uncommon to combine such variables into analysis of variance/covariance (ANCOVA) like predictor designs with main effects or interaction effects for categorical and continuous predictors. This method of analyzing coded ANCOVA-like designs is relatively new. However, it is easy to see how the use of coded predictor designs expands these powerful classification and regression techniques to the analysis of data from experimental.

## CHAID, C&RT, and QUEST

For classification-type problems (categorical dependent variable), all three algorithms can be used to build a tree for prediction. QUEST is generally faster than the other two algorithms, however, for very large datasets, the memory requirements are usually larger, so using the QUEST algorithms for classification with very large input data sets may be impractical.

For regression-type problems (continuous dependent variable), the QUEST algorithm is not applicable, so only CHAID and C&RT can be used. CHAID will build non-binary trees that tend to be “wider”. This has made the CHAID method particularly popular in market research applications: CHAID often yields many terminal nodes connected to a single branch, which can be conveniently summarized in a simple two-way table with multiple categories for each variable or dimension of the table. This type of display matches well the requirements for research on market segmentation, for example, it may yield a split on a variable *Income*, dividing that variable into 4 categories and groups of individuals belonging to those categories that are different with respect to some important consumer-behavior related variable (e.g., types of cars most likely to be purchased). C&RT will always yield binary trees, which can sometimes not be summarized as efficiently for interpretation and/or presentation.

As far as predictive accuracy is concerned, it is difficult to derive general recommendations, and this issue is still the subject of active research. As a practical matter, it is best to apply different algorithms, perhaps compare them with user-defined interactively derived trees, and decide on the most reasonably and best performing model based on the prediction errors. For a discussion of various schemes for combining predictions from different models, see, for example, Witten and Frank, 2000.

## Canonical Analysis

# How to Assess the Relationship Between Variables, Canonical Analysis

- General Purpose
- Computational Methods and Results
- Assumptions
- General Ideas
- Sum Scores
- Canonical Roots/Variates
- Number of Roots
- Extraction of Roots

## General Purpose of Canonical Analysis

There are several measures of correlation to express the relationship between two or more variables. For example, the standard Pearson product moment correlation coefficient (*r*) measures the extent to which two variables are related; there are various nonparametric measures of relationships that are based on the similarity of ranks in two variables; *Multiple Regression* allows one to assess the relationship between a dependent variable and a set of independent variables; Multiple Correspondence Analysis is useful for exploring the relationships between a set of categorical variables.

*Canonical Correlation* is an additional procedure for assessing the relationship between variables. Specifically, this analysis allows us to investigate the relationship between *two sets* of variables. For example, an educational researcher may want to compute the (simultaneous) relationship between three measures of scholastic ability with five measures of success in school. A sociologist may want to investigate the relationship between two predictors of social mobility based on interviews, with actual subsequent social mobility as measured by four different indicators. A medical researcher may want to study the relationship of various risk factors to the development of a group of symptoms. In all of these cases, the researcher is interested in the relationship between two sets of variables, and *Canonical Correlation* would be the appropriate method of analysis.

In the following topics, the major concepts and statistics in canonical correlation analysis are introduced. It is beneficial to be familiar with the correlation coefficient as described in Basic Statistics, and the basic ideas of multiple regression as described in the overview section of *Multiple Regression*.

## Computational Methods and Results

Following is a review of some of the computational issues involved in canonical correlation and the major results that are commonly reported.

**Eigenvalues. **When extracting the canonical roots, we compute the *eigenvalues*. These can be interpreted as the proportion of variance accounted for by the correlation between the respective canonical variates. Note that the proportion here is computed relative to the variance of the canonical variates, that is, of the weighted sum scores of the two sets of variables; the eigenvalues do *not* tell how much variability is explained in either set of variables. We compute as many eigenvalues as there are canonical roots, that is, as many as the minimum number of variables in either of the two sets.

**Successive eigenvalues will be of smaller and smaller size. **First, we compute the weights that maximize the correlation of the two sum scores. After this first root has been extracted, we will find the weights that produce the second largest correlation between sum scores, subject to the constraint that the next set of sum scores does not correlate with the previous one, and so on.

**Canonical correlations. **If the square root of the eigenvalues is taken, then the resulting numbers can be interpreted as correlation coefficients. Because the correlations pertain to the canonical variates, they are called *canonical correlations*. Like the eigenvalues, the correlations between successively extracted canonical variates are smaller and smaller. Therefore, as an overall index of the canonical correlation between two sets of variables, it is customary to report the largest correlation, that is, the one for the first root. However, the other canonical variates can also be correlated in a meaningful and interpretable manner (see below).

**Significance of Roots. **The significance test of the canonical correlations is straightforward in principle. Simply stated, the different canonical correlations are tested, one by one, beginning with the largest one. Only those roots that are statistically significant are then retained for subsequent interpretation. Actually, the nature of the significance test is somewhat different. First, evaluate the significance of all roots combined, then of the roots remaining after removing the first root, the second root, etc.

Some authors have criticized this sequential testing procedure for the significance of canonical roots (e.g., Harris, 1976). However, this procedure was “rehabilitated” in a subsequent Monte Carlo study by Mendoza, Markos, and Gonter (1978).

In short, the results of that study showed that this testing procedure will detect strong canonical correlations most of the time, even with samples of relatively small size (e.g., *n* = 50). Weaker canonical correlations (e.g., *R* = .3) require larger sample sizes (*n* > 200) to be detected at least 50% of the time. Note that canonical correlations of small magnitude are often of little practical value, as they account for very little actual variability in the data. This issue, as well as the sample size issue, will be discussed shortly.

**Canonical weights. **After determining the number of significant canonical roots, the question arises as to how to interpret each (significant) root. Remember that each root actually represents two weighted sums, one for each set of variables. One way to interpret the “meaning” of each canonical root would be to look at the weights for each set. These weights are called the *canonical weights* .

In general, the larger the weight (i.e., the absolute value of the weight), the greater is the respective variable’s unique positive or negative contribution to the sum. To facilitate comparisons between weights, the canonical weights are usually reported for the standardized variables, that is, for the *z* transformed variables with a mean of *0* and a standard deviation of *1*.

If you are familiar with multiple regression, you may interpret the canonical weights in the same manner as you would interpret the beta weights in a multiple regression equation. In a sense, they represent the *partial correlations* of the variables with the respective canonical root. If you are familiar with factor analysis, you may interpret the canonical weights in the same manner as you would interpret the *factor score coefficients*. To summarize, the canonical weights allow the user to understand the “make-up” of each canonical root, that is, it lets the user see how each variable in each set uniquely contributes to the respective weighted sum (canonical variate).

**Canonical Scores. **Canonical weights can also be used to compute actual values of the canonical variates; that is, we can simply use the weights to compute the respective sums. Again, remember that the canonical weights are customarily reported for the standardized (*z* transformed) variables.

**Factor structure.** Another way of interpreting the canonical roots is to look at the simple correlations between the canonical variates (or *factors*) and the variables in each set. These correlations are also called canonical factor *loadings*. The logic here is that variables that are highly correlated with a canonical variate have more in common with it. Therefore, we should weigh them more heavily when deriving a meaningful interpretation of the respective canonical variate. This method of interpreting canonical variates is identical to the manner in which factors are interpreted in factor analysis.

**Factor structure versus canonical weights. **Sometimes, the canonical weights for a variable are nearly zero, but the respective loading for the variable is very high. The opposite pattern of results may also occur. At first, such a finding may seem contradictory; however, remember that the canonical weights pertain to the unique contribution of each variable, while the canonical factor loadings represent simple overall correlations. For example, suppose we included in your satisfaction survey two items that measured basically the same thing, namely: (1) “Are you satisfied with your supervisors?” and (2) “Are you satisfied with your bosses?” Obviously, these items are very redundant. When the program computes the weights for the weighted sums (canonical variates) in each set so that they correlate maximally, it only “needs” to include one of the items to capture the essence of what they measure. Once a large weight is assigned to the first item, the contribution of the second item is redundant; consequently, it will receive a zero or negligibly small canonical weight. Nevertheless, if we then look at the simple correlations between the respective sum score with the two items (i.e., the factor *loadings*), those may be substantial for *both*. To reiterate, the canonical weights pertain to the *unique contributions* of the respective variables with a particular weighted sum or canonical variate; the canonical factor loadings pertain to the *overall correlation* of the respective variables with the canonical variate.

**Variance extracted. **As discussed earlier, the canonical correlation coefficient refers to the correlation between the weighted sums of the two sets of variables. It tells nothing about how much variability (variance) each canonical root explains in the *variables*. However, we can infer the proportion of variance extracted from each set of variables by a particular root by looking at the canonical factor loadings. Remember that those loadings represent correlations between the canonical variates and the variables in the respective set. If we square those correlations, the resulting numbers reflect the *proportion* of variance accounted for in each variable. For each root, we can take the average of those proportions across variables to get an indication of how much variability is explained, on the average, by the respective canonical variate in that set of variables. Put another way, we can compute in this manner the average proportion of *variance extracted* by each root.

**Redundancy. **The canonical correlations can be squared to compute the proportion of variance shared by the sum scores (canonical variates) in each set. If we multiply this proportion by the proportion of variance extracted, we arrive at a measure of *redundancy*, that is, of how redundant one set of variables is, given the other set of variables. In equation form, we can express the redundancy as:

Redundancy_{left} = [(loadings_{left}^{2})/p]*R_{c}^{2}

Redundancy_{right} = [(loadings_{right}^{2})/q]*R_{c}^{2}

In these equations, *p* denotes the number of variables in the first (left) set of variables, and *q* denotes the number of variables in the second (*right*) set of variables; *R _{c}^{2}* is the respective squared canonical correlation.

Note that we can compute the redundancy of the first (*left*) set of variables given the second (*right*) set, and the redundancy of the second (*right*) set of variables, given the first (*left*) set. Because successively extracted canonical roots are uncorrelated, we could sum up the redundancies across all (or only the first significant) roots to arrive at a single index of redundancy (as proposed by Stewart and Love, 1968).

**Practical significance. **The measure of redundancy is also useful for assessing the *practical* significance of canonical roots. With large sample sizes (see below), canonical correlations of magnitude *R = .30* may become statistically significant (see above). If we square this coefficient (*R-square = .09*) and use it in the redundancy formula shown above, it becomes clear that such canonical roots account for only very little variability in the variables. Of course, the final assessment of what does and does not constitute a finding of practical significance is subjective by nature. However, to maintain a realistic appraisal of how much actual variance (in the variables) is accounted for by a canonical root, it is important to always keep in mind the redundancy measure, that is, how much of the actual variability in one set of variables is explained by the other.

## Assumptions

The following discussion provides only a list of the most important assumptions of canonical correlation analysis, and the major threats to the reliability and validity of results. Distributions. The tests of significance of the canonical correlations is based on the assumption that the distributions of the variables in the population (from which the sample was drawn) are multivariate normal. Little is known about the effects of violations of the multivariate normality assumption. However, with a sufficiently large sample size (see below) the results from canonical correlation analysis are usually quite robust.

**Sample sizes. **Stevens (1986) provides a very thorough discussion of the sample sizes that should be used in order to obtain reliable results. As mentioned earlier, if there are strong canonical correlations in the data (e.g., *R > .7*), then even relatively small samples (e.g., *n = 50*) will detect them most of the time. However, in order to arrive at reliable estimates of the canonical factor loadings (for interpretation), Stevens recommends that there should be at least 20 times as many cases as variables in the analysis, if one wants to interpret the most significant canonical root only. To arrive at reliable estimates for two canonical roots, Barcikowski and Stevens (1975) recommend, based on a Monte Carlo study, to include 40 to 60 times as many cases as variables.

**Outliers. **Outliers can greatly affect the magnitudes of correlation coefficients. Since canonical correlation analysis is based on (computed from) correlation coefficients, they can also seriously affect the canonical correlations. Of course, the larger the sample size, the smaller is the impact of one or two outliers. However, it is a good idea to examine various scatterplots to detect possible outliers (as shown in the example animation below).

See also Confidence Ellipse.

**Matrix Ill-Conditioning. **One assumption is that the variables in the two sets should not be completely redundant. For example, if we included the *same* variable twice in one of the sets, then it is not clear how to assign different weights to each of them. Computationally, such complete redundancies will “upset” the canonical correlation analysis. When there are perfect correlations in the correlation matrix, or if any of the multiple correlations between one variable and the others is perfect (*R = 1.0*), then the correlation matrix cannot be inverted, and the computations for the canonical analysis cannot be performed. Such correlation matrices are said to be *ill-conditioned*.

Once again, this assumption appears trivial on the surface; however, it often is “almost” violated when the analysis includes very many highly redundant measures, as is often the case when analyzing questionnaire responses.

## General Ideas

Suppose we conduct a study in which we measure satisfaction at work with three questionnaire items, and satisfaction in various other domains with an additional seven items. The general question that we may want to answer is how satisfaction at work relates to the satisfaction in those other domains.

## Sum Scores

A first approach that we might take is simply to add up the responses to the work satisfaction items, and to correlate that sum with the responses to all other satisfaction items. If the correlation between the two sums is statistically significant, we could conclude that work satisfaction is related to satisfaction in other domains.

In a way this is a rather “crude” conclusion. We still know nothing about the particular domains of satisfaction that are related to work satisfaction. In fact, we could potentially have *lost* important information by simply adding up items. For example, suppose there were two items, one measuring satisfaction with one’s relationship with the spouse, the other measuring satisfaction with one’s financial situation. Adding the two together is, obviously, like adding “apples to oranges.” Doing so implies that a person who is dissatisfied with her finances but happy with her spouse is comparable overall to a person who is satisfied financially but not happy in the relationship with her spouse. Most likely, people’s psychological make-up is not that simple…

The problem then with simply correlating two sums is that one might lose important information in the process, and, in the worst case, actually “destroy” important relationships between variables by adding “apples to oranges.”

**Using a weighted sum. **It seems reasonable to correlate some kind of a weighted sum instead, so that the “structure” of the variables in the two sets is reflected in the weights. For example, if satisfaction with one’s spouse is only marginally related to work satisfaction, but financial satisfaction is strongly related to work satisfaction, then we could assign a smaller weight to the first item and a greater weight to the second item. We can express this general idea in the following equation:

a_{1}*y_{1} + a_{2}*y_{2} + … + a_{p}*y_{p} = b_{1}*x_{1} + b_{2}*x_{2} + … + b_{q}*x_{q}

If we have two sets of variables, the first one containing *p* variables and the second one containing *q* variables, then we would like to correlate the weighted sums on each side of the equation with each other.

**Determining the weights. **We have now formulated the general “model equation” for canonical correlation. The only problem that remains is how to determine the weights for the two sets of variables. It seems to make little sense to assign weights so that the two weighted sums do not correlate with each other. A reasonable approach to take is to impose the condition that the two weighted sums shall correlate maximally with each other.

## Canonical Roots/Variates

In the terminology of canonical correlation analysis, the weighted sums define a *canonical root* or *variate*. We can think of those canonical variates (weighted sums) as describing some underlying “latent” variables. For example, if for a set of diverse satisfaction items we were to obtain a weighted sum marked by large weights for all items having to do with work, we could conclude that the respective canonical variate measures satisfaction with work.

## Number of Roots

So far we have pretended as if there is only one set of weights (weighted sum) that can be extracted from the two sets of variables. However, suppose that we had among our work satisfaction items particular questions regarding satisfaction with pay, and questions pertaining to satisfaction with one’s social relationships with other employees. It is possible that the pay satisfaction items correlate with satisfaction with one’s finances, and that the social relationship satisfaction items correlate with the reported satisfaction with one’s spouse. If so, we should really derive two weighted sums to reflect this “complexity” in the structure of satisfaction.

In fact, the computations involved in canonical correlation analysis will lead to more than one set of weighted sums. To be precise, the number of roots extracted will be equal to the minimum number of variables in either set. For example, if we have three work satisfaction items and seven general satisfaction items, then three canonical roots will be extracted.

## Extraction of Roots

As mentioned before, we can extract roots so that the resulting correlation between the canonical variates is maximal. When extracting more than one root, each successive root will explain a *unique* additional proportion of variability in the two sets of variables. Therefore, successively extracted canonical roots will be uncorrelated with each other, and account for less and less variability.

## STATISTICA Product Catalog

* STATISTICA* Advanced

Includes the functionality of all of the following:

* STATISTICA Multivariate Exploratory Techniques* offers a broad selection of exploratory techniques, from cluster analysis to advanced classification trees methods, with a comprehensive array of interactive visualization tools for exploring relationships and patterns; built-in complete Visual Basic scripting.

- Cluster Analysis Techniques
- Factor Analysis and Principle Components
- Canonical Correlation Analysis
- Reliability/Item Analysis
- Classification Trees
- Correspondence Analysis
- Multidimensional Scaling
- Discriminant Analysis
- General Discriminant Analysis Models
*STATISTICA*Visual Basic Language, and more.

* STATISTICA Advanced Linear/Nonlinear Models* contains a wide array of the most advanced linear and nonlinear modeling tools on the market, supports continuous and categorical predictors, interactions, hierarchical models; automatic model selection facilities; also, includes variance components, time series, and many other methods; all analyses include extensive, interactive graphical support and built-in complete Visual Basic scripting.

- Distribution and Simulation
- Variance Components and Mixed Model ANOVA/ANCOVA
- Survival/Failure Time Analysis
- General Nonlinear Estimation (and Logit/Probit)
- Log-Linear Analysis
- Time Series Analysis, Forecasting
- Structural Equation Modeling/Path Analysis (
*SEPATH*) - General Linear Models (
*GLM*) - General Regression Models (
*GRM*) - Generalized Linear/Nonlinear Models (
*GLZ*) - Partial Least Squares (
*PLS*) *STATISTICA*Visual Basic Language, and more.

* STATISTICA Power Analysis and Interval Estimation*is an extremely precise and user-friendly research tool for analyzing all aspects of statistical power and sample size calculation.

- Power Calculations
- Sample Size Calculations
- Interval Estimation
- Probability Distribution Calculators, and more.

* STATISTICA Automated Neural Networks*

* STATISTICA Automated Neural Networks* contains a comprehensive array of statistics, charting options, network architectures, and training algorithms; C and PMML (Predictive Model Markup Language) code generators. The C code generator is an add-on.

Fully integrated with the * STATISTICA* system.

- A selection of the most popular network architectures including Multilayer Perceptrons, Radial Basis Function networks, Linear Networks and Self Organizing Feature Maps.
- State-of-the-art training algorithms including:

Conjugate Gradient Descent, BFGS, Kohonen training, k-Means Center Assignment - Forming ensembles of networks for better prediction performance
- Automatic Network Search, a tool for automating neural network architecture and complexity selection
- Best Network Retention, and more.
- Supporting various statistical analysis and model predictive model building including regression, classification, time series regression, time series classification and cluster analysis for dimensionality reduction and visualization.
- Fully supports deployment of multiple models

* STATISTICA Automated Neural Networks Code Generator*

* STATISTICA Automated Neural Networks Code Generator* can generate neural network code in both C and PMML (Predictive Model Markup Language) languages. The Code Generator Add-on enables * STATISTICA Automated Neural Networks *users to generate a C code file to be used for compiling a C program based on the output of a neural networks analysis.

- The C code generator add-on requires
*STATISTICA Neural Networks* - Generates a source code version of a neural network (in C or C++ file) which can be compiled with all C or C++ compilers.
- C code file can then integrated into external programs.

* STATISTICA Base*

* STATISTICA Base* offers a comprehensive set of essential statistics in a user-friendly package with flexible output management and Web enablement features; it also includes all * STATISTICA* graphics tools and a comprehensive Visual Basic development environment. The program is shipped on CD ROM.

- Descriptive Statistics, Breakdowns, and Exploratory Data Analysis
- Correlations
- Interactive Probability Calculator
- T-Tests (and other tests of group differences)
- Frequency Tables, Crosstabulation Tables, Stub-and-Banner Tables, Multiple Response Analysis
- Multiple Regression Methods
- Nonparametric Statistics
- Distribution Fitting
- Enhanced graphics technology
- Powerful query tools
- Flexible data management
- ANOVA [supports 4 between factors and 1 within (repeated measure) factor]
*STATISTICA*Visual Basic Language, and more.

* STATISTICA Data Miner*

Includes the functionality of all of the following:

* STATISTICA Automated Neural Networks*

* STATISTICA Data Miner* contains the most comprehensive selection of data mining solutions on the market, with an icon-based, extremely easy-to-use user interface. It features a selection of completely integrated, and automated, ready to deploy “as is” (but also easily customizable) specific data mining solutions for a wide variety of business applications. The product is offered optionally with deployment and on-site training services. The data mining solutions are driven by powerful procedures from five modules, which can also be used interactively and/or used to build, test, and deploy new solutions.

- General Slicer/Dicer Explorer
- General Classifier
- General Modeler/Multivariate Explorer
- General Forecaster
- General Neural Networks Explorer, and more.

Solution Packages to meet specific needs are available.

## *STATISTICA Scorecard*

*STATISTICA Scorecard*, a software solution for developing, evaluating, and monitoring scorecard models, includes the following capabilities and workflow:

- Data preparation
- Modelling
- Evaluation and calibration
- Monitoring

* STATISTICA Data Warehouse*

* STATISTICA Data Warehouse* is the ultimate high-performance, scalable system for intelligent management of unlimited amounts of data, distributed across locations worldwide.

* STATISTICA Document Management System*

* STATISTICA Document Management System* is a scalable solution for flexible, productivity-enhancing management of local or Web-based document repositories (FDA/ISO compliant).

* STATISTICA Enterprise*

* STATISTICA Enterprise* is an integrated multi-user software system designed for general purpose data analysis and business intelligence applications in research, marketing, finance, and other industries. * STATISTICA Enterprise* provides an efficient interface to enterprise-wide data repositories and a means for collaborative work as well as all the statistical functionality available in * STATISTICA Base*, *STATISTICA Advanced Models*, and * STATISTICA Exploratory Techniques* (optionally also *STATISTICA Automated Neural Networks* and *STATISTICA Power Analysis and Interval Estimation*).

- An efficient general interface to enterprise-wide repositories of data
- A means for collaborative work (groupware functionality)
- A reporting tool for formatted documents (PDF, HTML, MS Word) and analysis summaries of any of the tabular and graphical results produced by
*STATISTICA*. - Compatible with (and linkable to) industry-standard enterprise-wide database management systems
- Custom configurations including any applications from the
*STATISTICA*product line, and more.

* STATISTICA Enterprise / Quality Control (QC)*

* STATISTICA’s* comprehensive array of both routine and high-end statistical analyses, superior graphing technology, and unparalleled record of reviews gives *STATISTICA Enterprise/QC*many advantages over competing products. A unique combination of features not found in any other SPC system makes * STATISTICA Enterprise/QC* the most comprehensive SPC System available.

- Real-time analytical tools
- A high performance database
- Groupware functionality for sharing queries, special applications, etc.
- Wizard-driven system administration tools
- A sophisticated reporting tool for web-based output
- One-click access to analyses and reports
- Built-in security system
- User-specific interfaces
- Open-ended alarm notification including cause/action prompts
- Interactive querying facilities
- Integration with external applications (Word, Excel, browsers)
- and much, much more…

* STATISTICA Enterprise Web Viewer*

* STATISTICA Enterprise Web Viewer* provides the ability to view analyses and reports that were generated within * STATISTICA Enterprise*or * STATISTICA Enterprise / QC*. This allows companies to protect their data and reports with the * STATISTICA Enterprise* security model.

* STATISTICA Extract, Transform, and Load (ETL)*

* STATISTICA Extract, Transform, and Load (ETL)* provides options to simplify and facilitate access to, aggregation, and alignment of data from multiple databases, when some of the databases contain process data (using the optional PI Connector), while others contain “static” data (e.g., from Oracle or MS SQL Server). Provides for ad-hoc querying and aligning of data, for subsequent analyses such as ad-hoc charting etc. of data describing a specific time interval.

* STATISTICA Live Score*

* STATISTICA Live Score * is* STATISTICA *Server software within the

*Data Analysis and Data Mining Platform. Data are aggregated & cleaned and models are trained & validated using the*

*STATISTICA**software. Once the models are validated, they are deployed to the*

*STATISTICA*Data Miner*server*

*STATISTICA Live Score**.*provides multi-threaded, efficient, and platform-independent scoring of data from line-of-business applications.

*STATISTICA Live Score*

* STATISTICA Monitoring and Alerting Server (MAS)*

* STATISTICA Monitoring and Alerting Server (MAS)*is a system that enables users to automate the continual monitoring of hundreds or thousands of critical process and product parameters.

* STATISTICA MultiStream™ for Pharmaceutical Industries*

* STATISTICA MultiStream for Pharmaceutical Industries*is a solution package for identifying and implementing effective strategies for advanced multivariate process monitoring and control. * STATISTICA MultiStream* was designed for process industries in general, but is particularly well suited to help pharmaceutical manufacturers leverage the data collected into their existing specialized process data bases for multivariate and predictive process control.

* STATISTICA* MultiStream™ for Power Industries

* STATISTICA MultiStream for Power Industries* is a solution package for identifying and implementing effective strategies for advanced multivariate process monitoring and control. * STATISTICA MultiStream* was designed for process industries in general, but is particularly well suited to help power generation facilities leverage the data collected into their existing specialized process data bases for multivariate and predictive process control, for actionable advisory systems.

* STATISTICA Multivariate Statistical Process Control (MSPC)*

* STATISTICA Multivariate Statistical Process Control (MSPC)* is a complete solution for multivariate statistical process control, deployed within a scalable, secure analytics software platform.

* STATISTICA PI Connector*

* STATISTICA* PI Connector is an optional *STATISTICA* add-on component that allows for direct integration to data stored in the PI data historian. The *STATISTICA* PI Connector utilizes the PI user access control and security model, allows for interactive browsing of tags, and takes advantages of dedicated PI functionality for interpolation and snapshot data. * STATISTICA* integrated with the PI system is being used for streamlined and automated analyses for applications such as Process Analytical Technology (PAT) in FDA-regulated industries, Advanced Process Control (APC) systems in Chemical and Petrochemical industries, and advisory systems for process optimization and compliance in the Energy Utility industry.

* STATISTICA Process Optimization*

* STATISTICA Process Optimization*, an optional extension of *STATISTICA Data Miner*, is a powerful software solution designed to monitor processes and identify and anticipate problems related to quality control and improvement with unmatched sensitivity and effectiveness. *STATISTICA Process Optimization* integrates all quality control charts, process capability analyses, experimental design procedures, and Six Sigma methods with a comprehensive library of cutting-edge techniques for exploratory and predictive data mining.

* STATISTICA Process Optimization* enables its users to:

- Predict QC problems with cutting edge data mining methods
- Discover root causes of problem areas
- Monitor and improve ROI (Return On Investment)
- Generate suggestions for improvement
- Monitor processes in real time over the Web
- Create and deploy QC/SPC solutions over the Web
- Use multithreading and distributed processing to rapidly process extremely large streams of data.
- General Optimization

Solution Packages to meet specific needs are available.

* STATISTICA Quality Control (QC)*

Includes the functionality of all of the following:

* STATISTICA* Quality Control Charts offers versatile presentation-quality charts with a selection of automation options, customizable features, and user-interface shortcuts to simplify routine work.

- Quality Control Charts
- Interactive Quality Control Charts including:

Real-time updating of charts, automatic alarm notification, shop floor mode, assigning causes and actions, analytic brushing, and dynamic project management - Multivariate Quality Control Charts including: Hotelling T-Square Charts, Multiple Stream (Group), Multivariate Exponentially Moving Average (MEWMA) charts, Multivariate Cumulative Sum (MCUSUM) Charts, Generalized Variance Charts
*STATISTICA*Visual Basic Language, and more.

* STATISTICA Process Analysis* is a comprehensive package for process capability, Gage R&R, and other quality control/improvement applications.

- Process Capability Analysis
- Weibull Analysis
- Gage Repeatability & Reproducibility
- Sampling Plans
- Variance Components, and more.

* STATISTICA Design of Experiments* features the largest selection of DOE, visualization and other analytic techniques including powerful desirability profilers and extensive residual statistics.

- Fractional Factorial Designs
- Mixture Designs
- Latin Squares
- Search for Optimal 2**
*k*-p Designs - Residual Analysis and Transformations
- Optimization of Single or Multiple Response Variables
- Central Composite Designs
- Taguchi Designs
- Desirability Profiler
- Minimum Aberration and Maximum Unconfounding 2**
*k*-p Fractional Factorial Designs with Blocks - Constrained Surfaces
- D- and A-optimal Designs, and more.

* STATISTICA Power Analysis and Interval Estimation*is an extremely precise and user-friendly research tool for analyzing all aspects of statistical power and sample size calculation.

- Power Calculations
- Sample Size Calculations
- Interval Estimation
- Probability Distribution Calculators, and more.

* STATISTICA Sequence, Association, and Link Analysis (SAL)*

* STATISTICA Sequence, Association and Link Analysis (SAL)* is designed to address the needs of clients in retailing, banking and insurance, etc., industries by implementing the fastest known highly scalable algorithm with the ability to drive Association and Sequence rules in one single analysis. The program represents a stand-alone module that can be used for both model building and deployment. All tools in * STATISTICA Data Miner* can be quickly and effortlessly leveraged to analyze and “drill into” results generated via *STATISTICA SAL*.

- Uses a Tree-Building technique to extract Association and Sequence rules from data
- Uses efficient and thread-safe local relational Database technology to store Association and Sequence models
- Handles multiple response, multiple dichotomy and continuous variables in one analysis
- Performs Sequence analysis while mining for Association rules in a single analysis
- Simultaneously extracts Association and Sequence rules for more than one dimension
- Given the ability to perform multidimensional Association and Sequence mining and the capacity to extract only rules for specific items, the program can be used for Predictive Data Mining
- Performs Hierarchical Single-Linkage Cluster analysis which can detect the more likely cluster of items that can occur. This has extremely useful, practical real-world applications such as in Retailing.

* STATISTICA Text Miner*

* STATISTICA Text Miner* is an optional extension of *STATISTICA Data Miner*. The program features a large selection of text retrieval, pre-processing, and analytic and interpretive mining procedures for unstructured text data (including Web pages), with numerous options for converting text into numeric information (for mapping, clustering, predictive data mining, etc.), language-specific stemming algorithms. Because *STATISTICA*’s flexible data import options, the methods available in *STATISTICA Text Miner* can also be useful for processing other unstructured input (e.g., image files imported as data matrices, etc.).

* STATISTICA Web Based Data Entry*

* STATISTICA Web Data Entry* enables companies to configure data entry scenarios to allow data entry via Web browsers and the analysis of these data using all of the graphical data analysis, statistical analysis, and data mining capabilities of the *STATISTICA Enterprise* software platform

* STATISTICA Web Data Entry * builds on the configuration objects in *STATISTICA Enterprise*:

**Characteristics:**Numeric data to be collected for analysis (e.g., pH)**Labels:**Text or date data for traceability (e.g., Lot Number)**Data Entry Setups:**Groups of Characteristics and Labels configured with specific User/Group permissions to collect the appropriate data for particular scenarios

* STATISTICA Variance Estimation and Precision*

* STATISTICA* Variance Estimation and Precision is a comprehensive set of techniques for analyzing data from experiments that include both fixed and random effects using REML (Restricted Maximum Likelihood Estimation). With Variance Estimation and Precision, users can obtain estimates of variance components and use them to make precision statements while at the same time comparing fixed effects in the presence of multiple sources of variation.

Variance Estimation and Precision includes the following:

- Variability plots
- Multiple plot layouts to allow direct comparison of multiple dependent variables
- Expected mean squares and variance components with confidence intervals
- Flexible handling of multiple dependent variables: analyze several variables with the same or different designs at once
- Graph displays of variance components

* WebSTATISTICA Knowledge Portal*

WebSTATISTICA Knowledge Portal is the ultimate knowledge-sharing tool. It incorporates the latest Internet technology and includes a powerful, flexible report generation tool and a secure system for information delivery.

* WebSTATISTICA Server Applications*

* WebSTATISTICA Server Applications* is the ultimate enterprise system that offers full Web enablement, including the ability to run * STATISTICA* interactively or in batch from a Web browser on any computer (incl. Linux, UNIX), offload time consuming tasks to the servers (using distributed processing), use multi-tier Client-Server architecture, manage projects over the Web, and collaborate “across the hall or across continents.”

- Work collaboratively “across the hall” or “across continents”
- Run
*STATISTICA*using any computer in the world (connected to the Internet - Offload time-consuming tasks to the servers
- Manage/administer projects over the Web
- Develop highly customized Web applications
- and much, much more…

## The 13th Annual KDnuggets™ Software Poll – STATISTICA

*“For the first time, the number of users of free/open source software exceeded the number of users of commercial software. The usage of Big Data software grew five-fold. R, Excel, and RapidMiner were the most popular tools, with StatSoft STATISTICA getting the top commercial tool spot.”* – KDnuggets.com

The 13th Annual KDnuggets™ Software Poll asked: **What Analytics, Data Mining, or Big Data software have you used in the past 12 months for a real project (not just evaluation)?**

This May 2012 poll attracted “a very large number of participants and used email verification” to ensure one vote per respondent. Once again, StatSoft’s *STATISTICA* received very high marks, earning “top commercial tool” in this poll.

Complete poll results and analysis can be found at http://www.kdnuggets.com/2012/05/top-analytics-data-mining-big-data-software.html.

KDnuggets.com is a data mining portal and newsletter publisher for the data mining community with more than 12,000 subscribers.

## STATISTICA Solutions for Chemical and Petrochemical

Chemical and Petrochemical organizations are among the largest users of *STATISTICA* applications, benefiting from *STATISTICA* analytics both in Research & Development and Manufacturing.

## Research & Development

One contributing factor in a chemical/petrochemical company’s success is the ability of the R&D scientists to discover and develop a product formulation with useful properties.

**The STATISTICA platform results in hard and soft ROI by:**

- Empowering scientists with the analytic and exploratory tools to make more sound decisions and gain greater insights from the precious data that they collect
- Saving the scientists’ time by integrating analytics in their core processes
- Saving the statisticians’ time to focus on the delivery and packaging of effective analytic tools within the
*STATISTICA*framework - Increasing the level of collaboration across the R&D organization by sharing study results, findings, and reports

*STATISTICA* provides a broad base of integrated statistical and graphical tools including:

- Tools for basic research such as Exploratory Graphical Analysis, Descriptive Statistics, t-tests, Analysis of Variance, General Linear Models, and Nonlinear Curve Fitting.

- Tools for more advanced analyses such as a variety of clustering, predictive modeling, classification and machine learning approaches including Principal Components Analysis, The
*STATISTICA*platform meets the needs of both scientists and statisticians in your R&D organization.

## Manufacturing

Chemical and Petrochemical organizations have deployed *STATISTICA* within their manufacturing processes in several ways:

- These organizations have arrived at a greater understanding of their process parameters and their relationship to product quality by applying
*STATISTICA*‘s multivariate statistical process control (SPC) techniques.*STATISTICA*integrates with their process information repositories and LIMS systems to retrieve the data required to perform these analyses.

- These organizations have also utilized the deployment capabilities of
*STATISTICA*‘s Data Mining algorithms to integrate advanced modeling techniques such as Neural Network, Recursive Partitioning approaches (CHAID, C&RT, Boosted Trees), MARSplines, Independent Components Analysis, and Support Vector Machines.*STATISTICA*allows them to deploy a fully-trained predictive model in Predictive Modeling Markup Language (PMML), C++ and Visual Basic for ongoing monitoring of a process. These models based, once trained and evaluated on historical data, are deployed as “soft sensors” for the ongoing monitoring and control of process parameters.

## 10+ Great Metrics and Strategies for Fraud Detection

Emphasis here is on web log data. More than one rule must be triggered to fire an alarm. You may use a system such as hidden decision trees to assign a specific weight to each rule.

- Monte Carlo simulations to detect extreme events. Example: large cluster of non-proxy IP addresses that have exactly 8 clicks per day, day after day. What is the chance of this happening
*naturally*? - IP address or referral domain belongs to a particular type of blacklist, or whitelist. Classify the space of IP addresses into major clusters: static IP, anonymous proxy, corporate proxy (white-listed), edu proxy (high risk), highly recycled IP (higher risk), etc.
- Referral domain statistics: time to load with variance (based on 3 measurements), page size with variance (based on 3 measurements), text strings found on web page (either in HTML or Javascript code). Create list of suspicious terms (viagra, online casino etc.) Create list of suspicious Javascript tags or codes but use white list of referral domains (e.g. top publishers) to eliminate false positives.
- Analyse domain name patterns, example: a cluster of domain names, with exactly identical fraud scores, are all of the form xxx-and-yyy.com, and their web page all have the same size (1 char).
- Association analysis: buckets of traffic with a huge proportion (>30%) of very short (< 15 seconds) sessions that have two or more unknown referrals (that is, referrals other than Facebook, Google, Yahoo or a top 500 domain). Aggregate all these mysterious referrals across these sessions – chances are that they are all part of a same Botnet scheme (used e.g. for click fraud).
- Mismatch in credit card fields: phone number in one country, email or IP adress from a proxy domain owned by someone located in another country, physical address yet in another state, name (e.g. Amy) and email address (e.g. joy431232@hotmail.com) look very different, and a Google search on the email address reveals previous scams operated from same account, or nothing at all
- Referral web page or search keyword attached to a paid click contains gibberish or text strings made of letters that are very close on the keyboard, such as fgdfrffrft.
- Email address contains digits other than area code, year (e.g. 73) or zip-code (except if from someone in India or China)
- Time to 1st transaction after sign-up is very short
- Abnormal purchase pattern (Sunday at 2am, buy most expensive product on your e-store, from an IP outside US, on a B2B e-store targeted to US clients)
- Same small popular dollar amount (e.g. $9.99) across multiple merchants with same merchant category, with one or two transactions per cardholder