# Blog Archives

## STATISTICA Multivariate Exploratory Techniques

* STATISTICA* Multivariate Exploratory Techniques offers a broad selection of exploratory techniques, from cluster analysis to advanced classification trees methods, with an endless array of interactive visualization tools for exploring relationships and patterns; built-in complete Visual Basic scripting.

- Cluster Analysis Techniques
- Factor Analysis
- Principal Components & Classification Analysis
- Canonical Correlation Analysis
- Reliability/Item Analysis
- Classification Trees
- Correspondence Analysis
- Multidimensional Scaling
- Discriminant Analysis
- General Discriminant Analysis Models (
*GDA*)

**Details**

## Cluster Analysis

This module includes a comprehensive implementation of clustering methods (*k*-means, hierarchical clustering, two-way joining). The program can process data from either raw data files or matrices of distance measures. The user can cluster cases, variables, or both based on a wide variety of distance measures (including Euclidean, squared Euclidean, City-block (Manhattan), Chebychev, Power distances, Percent disagreement, and *1-r*) and amalgamation/linkage rules (including single, complete, weighted and unweighted group average or centroid, Ward’s method, and others). Matrices of distances can be saved for further analysis with other modules of the * STATISTICA* system. In *k*-means clustering, the user has full control over the initial cluster centers. Extremely large analysis designs can be processed; for example, hierarchical (tree) joining can analyze matrices with over 1,000 variables, or with over 1 million distances. In addition to the standard cluster analysis output, a comprehensive set of descriptive statistics and extended diagnostics (e.g., the complete amalgamation schedule with cohesion levels in hierarchical clustering, the ANOVA table in *k*-means clustering) is available. Cluster membership data can be appended to the current data file for further processing. Graphics options in the *Cluster Analysis* module include customizable tree diagrams, discrete contour-style two-way joining matrix plots, plots of amalgamation schedules, plots of means in *k*-means clustering, and many others.

## Factor Analysis

The *Factor Analysis* module contains a wide range of statistics and options, and provides a comprehensive implementation of factor (and hierarchical factor) analytic techniques with extended diagnostics and a wide variety of analytic and exploratory graphs. It will perform principal components, common, and hierarchical (oblique) factor analysis, and can handle extremely large analysis problems (e.g., with thousands of variables). Confirmatory factor analysis (as well as path analysis) can also be performed via the *Structural Equation Modeling and Path Analysis (SEPATH)* module found in * STATISTICA Advanced Linear/Non-Linear Models*.

## Principal Components & Classification Analysis

* STATISTICA* also includes a designated program for principal components and classification analysis. The output includes eigenvalues (regular, cumulative, relative), factor loadings, factor scores (which can be appended to the input data file, reviewed graphically as icons, and interactively recoded), and a number of more technical statistics and diagnostics. Available rotations include Varimax, Equimax, Quartimax, Biquartimax (either normalized or raw), and Oblique rotations. The factorial space can be plotted and reviewed “slice by slice” in either 2D or 3D scatterplots with labeled variable-points; other integrated graphs include Scree plots, various scatterplots, bar and line graphs, and others. After a factor solution is determined, the user can recalculate (i.e., reconstruct) the correlation matrix from the respective number of factors to evaluate the fit of the factor model. Both raw data files and matrices of correlations can be used as input. Confirmatory factor analysis and other related analyses can be performed with the * Structural Equation Modeling and Path Analysis (SEPATH)* module available in *STATISTICA Advanced Linear/Non-Linear Models*, where a designated *Confirmatory Factor Analysis Wizard* will guide you step by step through the process of specifying the model.

## Canonical Correlation Analysis

This module offers a comprehensive implementation of canonical analysis procedures; it can process raw data files or correlation matrices and it computes all of the standard canonical correlation statistics (including eigenvectors, eigenvalues, redundancy coefficients, canonical weights, loadings, extracted variances, significance tests for each root, etc.) and a number of extended diagnostics. The scores of canonical variates can be computed for each case, appended to the data file, and visualized via integrated icon plots. The *Canonical Analysis* module also includes a variety of integrated graphs (including plots of eigenvalues, canonical correlations, scatterplots of canonical variates, and many others). Note that confirmatory analyses of structural relationships between latent variables can also be performed via the *SEPATH (Structural Equation Modeling and Path Analysis)* module in * STATISTICA Advanced Linear/Non-Linear Models*. Advanced stepwise and best-subset selection of predictor variables for MANOVA/MANCOVA designs (with multiple dependent variables) is available in the *General Regression Models (GRM)* module in *STATISTICA Advanced Linear/Non-Linear Models*.

## Reliability/Item Analysis

This module includes a comprehensive selection of procedures for the development and evaluation of surveys and questionnaires. As in all other modules of *STATISTICA*, extremely large designs can be analyzed. The user can calculate reliability statistics for all items in a scale, interactively select subsets, or obtain comparisons between subsets of items via the “split-half” (or split-part) method. In a single run, the user can evaluate the reliability of a sum-scale as well as subscales. When interactively deleting items, the new reliability is computed instantly without processing the data file again. The output includes correlation matrices and descriptive statistics for items, Cronbach * alpha*, the standardized *alpha*, the average inter-item correlation, the complete ANOVA table for the scale, the complete set of item-total statistics (including multiple item-total *R*‘s), the split-half reliability, and the correlation between the two halves corrected for attenuation. A selection of graphs (including various integrated scatterplots, histograms, line plots and other plots) and a set of interactive * what-if* procedures are provided to aid in the development of scales. For example, the user can calculate the expected reliability after adding a particular number of items to the scale, and can estimate the number of items that would have to be added to the scale in order to achieve a particular reliability. Also, the user can estimate the correlation corrected for attenuation between the current scale and another measure (given the reliability of the current scale).

## Classification Trees

*STATISTICA’s Classification Trees* module provides a comprehensive implementation of the most recently developed algorithms for efficiently producing and testing the robustness of classification trees (a classification tree is a rule for predicting the class of an object from the values of its predictor variables). *STATISTICA Data Miner* offers additional advanced methods for tree classifications such as * Boosted Trees, Random Forests, General Classification and Regression Tree Models (GTrees)* and *General CHAID (Chi-square Automatic Interaction Detection)* models facilities. Classification trees can be produced using categorical predictor variables, ordered predictor variables, or both, and using univariate splits or linear combination splits.

Analysis options include performing exhaustive splits or discriminant-based splits; unbiased variable selection (as in *QUEST*); direct stopping rules (as in *FACT*) or bottom-up pruning (as in *C&RT*); pruning based on misclassification rates or on the deviance function; generalized *Chi-square, G-square,* or *Gini*-index goodness of fit measures. Priors and misclassification costs can be specified as equal, estimated from the data, or user-specified. The user can also specify the v value for v-fold cross-validation during tree building, *v* value for * v*-fold cross-validation for error estimation, size of the SE rule, minimum node size before pruning, seeds for random number generation, and *alpha* value for variable selection. Integrated graphics options are provided to explore the input and output data.

## Correspondence Analysis

This module features a full implementation of simple and multiple correspondence analysis techniques, and can analyze even extremely large tables. The program will accept input data files with grouping (coding) variables that are to be used to compute the crosstabulation table, data files that contain frequencies (or some other measure of correspondence, association, similarity, confusion, etc.) and coding variables that identify (enumerate) the cells in the input table, or data files with frequencies (or other measure of correspondence) only (e.g., the user can directly type in and analyze a frequency table). For multiple correspondence analysis, the user can also directly specify a *Burt* table as input for the analysis. The program will compute various tables, including the table of row percentages, column percentages, total percentages, expected values, observed minus expected values, standardized deviates, and contributions to the *Chi-square* values. The *Correspondence Analysis* module will compute the generalized eigenvalues and eigenvectors, and report all standard diagnostics including the singular values, eigenvalues, and proportions of inertia for each dimension. The user can either manually choose the number of dimensions, or specify a cutoff value for the maximum cumulative percent of inertia. The program will compute the standard coordinate values for column and row points. The user has the choice of row-profile standardization, column-profile standardization, row and column profile standardization, or canonical standardization. For each dimension and row or column point, the program will compute the inertia, quality, and cosine-square values. In addition, the user can display (in spreadsheets) the matrices of the generalized singular vectors; like the values in all spreadsheets, these matrices can be accessed via *STATISTICA* Visual Basic, for example, in order to implement non-standard methods of computing the coordinates. The user can compute coordinate values and related statistics (quality and cosine-square values) for supplementary points (row or column), and compare the results with the regular row and column points. Supplementary points can also be specified for multiple correspondence analysis. In addition to the 3D histograms that can be computed for all tables, the user can produce a line plot for the eigenvalues, and 1D, 2D, and 3D plots for the row or column points. Row and column points can also be combined in a single graph, along with any supplementary points (each type of point will use a different color and point marker, so the different types of points can easily be identified in the plots). All points are labeled, and an option is available to truncate the names for the points to a user-specified number of characters.

## Multidimensional Scaling

The *Multidimensional Scaling* module includes a full implementation of (nonmetric) multidimensional scaling. Matrices of similarities, dissimilarities, or correlations between variables (i.e., “objects” or cases) can be analyzed. The starting configuration can be computed by the program (via principal components analysis) or specified by the user. The program employs an iterative procedure to minimize the stress value and the coefficient of alienation. The user can monitor the iterations and inspect the changes in these values. The final configurations can be reviewed via spreadsheets, and via 2D and 3D scatterplots of the dimensional space with labeled item-points. The output includes the values for the raw stress (raw *F*), Kruskal stress coefficient *S*, and the coefficient of alienation. The goodness of fit can be evaluated via Shepard diagrams (with *d-hats* and *d-stars*). Like all other results in *STATISTICA*, the final configuration can be saved to a data file.

## Discriminant Analysis

The *Discriminant Analysis* module is a full implementation of multiple stepwise discriminant function analysis. * STATISTICA* also includes the * General Discriminant Analysis Models* module (below) for fitting ANOVA/ANCOVA-like designs to categorical dependent variables, and to perform various advanced types of analyses (e.g., best subset selection of predictors, profiling of posterior probabilities, etc.). The * Discriminant Analysis* program will perform forward or backward stepwise analyses, or enter user-specified blocks of variables into the model.

In addition to the numerous graphics and diagnostics describing the discriminant functions, the program also provides a wide range of options and statistics for the classification of *old* or * new* cases (for validation of the model). The output includes the respective Wilks’ *lambdas*, partial *lambdas, F* to enter (or remove), the *p* levels, the tolerance values, and the *R*-square. The program will perform a full canonical analysis and report the raw and cumulative eigenvalues for all roots, and their *p* levels, the raw and standardized discriminant (canonical) function coefficients, the structure coefficient matrix (of factor loadings), the means for the discriminant functions, and the discriminant scores for each case (which can also be automatically appended to the data file). Integrated graphs include histograms of the canonical scores within each group (and all groups combined), special scatterplots for pairs of canonical variables (where group membership of individual cases is visibly marked), a comprehensive selection of categorized (multiple) graphs allowing the user to explore the distribution and relations between dependent variables across the groups (including multiple box-and-whisker plots, histograms, scatterplots, and probability plots), and many others. The * Discriminant Analysis* module will also compute the standard classification functions for each group. The classification of cases can be reviewed in terms of Mahalanobis distances, *posterior* probabilities, or actual classifications, and the scores for individual cases can be visualized via exploratory icon plots and other multidimensional graphs integrated directly with the results spreadsheets. All of these values can be automatically appended to the current data file for further analyses. The summary classification matrix of the number and percent of correctly classified cases can also be displayed. The user has several options to specify the *a priori* classification probabilities and can specify selection conditions to include or exclude selected cases from the classification (e.g., to validate the classification functions in a new sample).

## General Discriminant Analysis Models (GDA)

The * STATISTICA General Discriminant Analysis (GDA)* module is an application and extension of the * General Linear Model* to classification problems. Like the * Discriminant Analysis* module, * GDA* allows you to perform standard and stepwise discriminant analyses. *GDA* implements the discriminant analysis problem as a special case of the general linear model, and thereby offers extremely useful analytic techniques that are innovative, efficient, and extremely powerful. As in traditional discriminant analysis, *GDA* allows you to specify a categorical dependent variable. For the analysis, the group membership (with regard to the dependent variable) is then coded into indicator variables, and all methods of *GRM* can be applied. In the results dialogs, the extensive selection of residual statistics of *GRM* and *GLM* are available in *GDA* as well. *GDA* provides powerful and efficient tools for data mining as well as applied research. *GDA* will compute all standard results for discriminant analysis, including discriminant function coefficients, canonical analysis results (standardized and raw coefficients, step-down tests of canonical roots, etc.), classification statistics (including Mahalanobis distances, posterior probabilities, actual classification of cases in the analysis sample and validation sample, misclassification matrix, etc.), and so on.

** Computational approach and unique applications.** As in traditional discriminant analysis, *GDA* allows you to specify a categorical dependent variable. For the analysis, the group membership (with regard to the dependent variable) is then coded into indicator variables, and all methods of *GRM* can be applied. In the results dialogs, the extensive selection of residual statistics of *GRM* and *GLM* are available in GDA as well; for example, you can review all the regression-like residuals and predicted values for each group (each coded dependent indicator variable), and choose from the large number of residual plots. In addition, all specialized prediction and classification statistics are computed that are commonly reviewed in a discriminant analysis; but those statistics can be reviewed in innovate ways because of * STATISTICA’s* unique approach. For example, you can perform “desirability profiling” by combining the posterior prediction probabilities for the groups into a desirability score, and then let the program find the values or combination of categorical predictor settings that will optimize that score. Thus, *GDA* provides powerful and efficient tools for data mining as well as applied research; for example, you could use the *DOE (Design of Experiments)* methods to generate an experimental design for quality improvement, apply this design to categorical outcome data (e.g., distinct classifications of an outcome as “superior,” “acceptable,” or “failed”), and then model the posterior prediction probabilities of those outcomes using the variables of your experimental design.

** Standard discriminant analysis results.** *STATISTICA GDA* will compute all standard results for discriminant analysis, including discriminant function coefficients, canonical analysis results (standardized and raw coefficients, step-down tests of canonical roots, etc.), classification statistics (including Mahalanobis distances, posterior probabilities, actual classification of cases in the analysis sample and validation sample, misclassification matrix, etc.), and so on.

** Unique features of GDA, currently only available in STATISTICA.** In addition,

*STATISTICA GDA*includes numerous unique features and results:

* Specifying predictor variables and effects; model building:*

1. Support for continuous and categorical predictors, instead of allowing only continuous predictors in the analysis (the common limitation in traditional discriminant function analysis programs), GDA allows the user to specify simple and complex ANOVA and ANCOVA-like designs, e.g., mixtures of continuous and categorical predictors, polynomial (response surface) designs, factorial designs, nested designs, etc.

2. Multiple-degree of freedom effects in stepwise selection; the terms that make up the predictor set (consisting not only of single-degree of freedom continuous predictors, but also multiple-degree of freedom effects) can be used in stepwise discriminant function analyses; multiple-degree of freedom effects will always be entered/removed as blocks.

3. Best subset selection of predictor effects; single- and multiple-degree of freedom effects can be specified for best-subset discriminant analysis; the program will select the effects (up to a user-specified number of effects) that produce the best discrimination between groups.

4. Selection of predictor effects based on misclassification rates; *GDA* allows the user to perform model building (selection of predictor effects) not only based on traditional criteria (e.g., p-to-enter/remove; Wilks’ lambda), but also based on misclassification rates; in other words the program will select those predictor effects that maximize the accuracy of classification, either for those cases from which the parameter estimates were computed, or for a cross-validation sample (to guard against over-fitting); these techniques elevate *GDA* to the level of a fast neural-network-like data mining tool for classification, that can be used as an alternative to other similar techniques (tree-classifiers, designated neural-network methods, etc.; *GDA* will tend to be faster than those techniques because it is still based on the more efficient *General Linear Model*).

* Results statistics; profiling:*

1. Detailed results and diagnostic statistics and plots; in addition to the standard results statistics, * GDA* provides a large number of auxiliary information to help the user judge the adequacy of the chosen disciminant analysis model (descriptive statistics and graphs, Mahalanobis distances, Cook distances, and leverages for predictors, etc.). 2. Profiling of expected classification; *GDA* includes an adaptation of the general *GLM (GRM)* response profiler; these options allow the user to quickly determine the values (or levels) of the predictor variables that maximize the posterior classification probability for a single group, or for a set of groups in the analyses; in a sense, the user can quickly determine the typical profiles of values of the predictors (or levels of categorical predictors) that identify a group (or set of groups) in the analysis.

* A note of caution for models with categorical predictors, and other advanced techniques.* The *General Discriminant Analysis* module provides functionality that makes this technique a general tool for classification and data mining. However, most — if not all — textbook treatments of discriminant function analysis are limited to simple and stepwise analyses with single degree of freedom continuous predictors. No “experience” (in the literature) exists regarding issues of robustness and effectiveness of these techniques, when they are generalized in the manner provided in this very powerful module. The use of best-subset methods, in particular when used in conjunction with categorical predictors or when using the misclassification rates in a crossvalidation sample for choosing the best subset of predictors, should be considered a heuristic search method, rather than a statistical analysis technique.

**System Requirements**

* STATISTICA Multivariate Exploratory Techniques * is compatible with Windows XP, Windows Vista, and Windows 7.

## Minimum System Requirements

- Operating System: Windows XP or above
- RAM: 256 MB
- Processor Speed: 500 MHz

## Recommended System Requirements

- Operating System: Windows XP or above
- RAM: 1 GB
- Processor Speed: 2.0 GHz

Native 64-bit versions and highly optimized multiprocessor versions are available.