Monthly Archives: May 2012

StatSoft South Africa has upcoming Statistica Training! Click here to register for your course :)

Dear STATISTICA user,

STATISTICA v10 is a comprehensive, integrated data analysis, graphics, database management, & custom application development system featuring a wide selection of basic & advanced analytic procedures for business, data mining, science, & engineering applications.
State-of-the-art software is costly & should be kept updated. Students must be availed of the latest training, ensuring you the maximum benefit on your investment.
Professional training by experts in statistics, methodology & practical applications, as well as in “tips & tricks”, can greatly enhance user productivity. Expert training guides users/prospective-users through the wealth of functionality the program offers. The streamlining of your workflow could improve 100%, saving the both time & money.
Our training services help improve analytical skills, whether a beginner or advanced user. StatSoft educators represent the world’s finest analytic expertise drawn from industry to academic institutions.
Our 2010 training encompasses the latest, up-to-date, time & cost-saving techniques, keeping you abreast of the latest techniques & world-trends, & bringing value to your organization.

Introduction to STATISTICA – 20 & 21 June 2012
This two-day course allows the delegate to take full advantage of the numerous statistical & graphical tools available in STATISTICA. Step-by-step examples are provided iro entering & manipulate data (including importing from other file formats such as spreadsheets and databases), performing complete statistical analyses & interpretation of tabular & graphical results, as well as the creation & customization of many graph types. The user will learn how to optimize the usage of STATISTICA. The last segment of the course will be devoted to questions & answers.

VENUE: Petervale Centre, Cor Cambridge & Frans Hals Rds, Bryanston, Sandton, Johannesburg
DAY 1	DAY 2
8.30 for 9 a.m. General Conventions: User-interface Customisation options Creating reports, docs Exercises & FAQ’s Data Management: Creating, modifying & saving data Importing data File structure manipulation Exercises & FAQ’s Statistics: · Overview of elementary concepts Descriptive Statistics T-Tests Correlations Frequency Tables Cross Tabulations 4 p.m.	8.30 for 9 a.m. Statistics Continued: Exercises & FAQ’s Graphs: Overview of Graph types Creating Graphs Customising Graphs Brushing Techniques Curve-fitting Exercises & FAQ’s Automation: Introduction to automating & customising Exercises Overview of Additional Add-on Modules & their Applications Industry-specific Example Applications Questions & Answer Session 4 p.m.
StatSoft Southern Africa offers both introductory and advanced training courses in Johannesburg and other major cities in South Africa. StatSoft training classes offer: Practical hands-on experience with the program An introduction to real-world example applications Energetic, helpful, knowledgeable instructors Comprehensive take-home course manual Personal attention, small class size Interactive, class-paced learning For more information or to register for this training please phone Lorraine Edel at 011-234-6148 or 082-5678-330 or mail info@statsoft.co.za. Please contact lorraine@statsoft.co.za to register www.statsoft.co.za The demo of Statistica Version 10 is now available, to download and try Statistica Version 10 Click Here. Kind Regards LORRAINE EDEL Statsoft Southern Africa Tel: 011 234-6148 Fax: 086 544-1172 Cell: 082 5678 330 Mail: lorraine@statsoft.co.za Web: www.statsoft.co.za Powerful solutions for: * Business intelligence * Data mining * Quality control * Web-based analytics * Research

Posted in Uncategorized

Leave a comment

Tags: analysis, certificate, course, software, statistica training, statistics

Getting started with Statistica, Tutorials, Popular Videos!

May 29

Posted by statsoftsa

No need to feel lost getting started with STATISTICA! We’ve got you covered with our popular videos on text mining, data mining, and all things analytic.

Video Tutorials

Introductory Overview
There are several playlists on the StatSoft YouTube Channel.

Introductory Overview

Welcome to STATISTICA, where every analysis you will ever need is at your fingertips. Used around the world in at least 30 countries, StatSoft’s STATISTICA line of software has gained unprecedented recognition by users and reviewers. In addition to both basic and advanced statistics, STATISTICA products offer specialized tools for analyzing neural networks, determining sample size, designing experiments, creating real-time quality control charts, reporting via the Web, and much more . . . the possibilities are endless.

Video	Title
	Use the Analysis Toolbar In this demonstration, see the benefits of convenient multi-tasking functionality in STATISTICA. Run multiple copies of STATISTICA at the same time, run multiple analyses of the same or different kinds, run analyses on the same or different data files, or do all three.
	Save and Retrieve Projects STATISTICA Projects provide the means to save your work and return to it later. A project is a “snapshot” of STATISTICA at the time it was saved: input data, results including graphs, spreadsheets, workbooks and reports, and data miner workspaces. This tutorial explains how projects are used.
	Use Variable Bundles With the Variable Bundles Manager, you can easily create bundles of variables in order to organize large sets of variables and to facilitate the repeated selection of the same set of variables. By creating bundles, you can quickly and easily locate a subset of data in a large data file.
	Perform By Group Analysis With STATISTICA, you can generate output for each unique level of a By variable or unique combination of multiple By variables at the individual results level. This makes it very easy to compare results of an analysis across different groups.
	Select Subsets of Cases In this demonstration, see the extremely flexible facilities for case selection provided in STATISTICA. You can specify cases in two different ways, either temporarily, only for the duration of a single analysis, or more permanently for all subsequent analyses using the current spreadsheet.
	Data Filtering/Cleaning Data Cleaning is an important first step in Data Mining and general analysis projects. This tutorial illustrates several of the data cleaning tools of STATISTICA.
	Use Spreadsheet Formulas Variables in STATISTICA Spreadsheets can be defined by formulas that support a wide selection of mathematical, logical, and statistical functions. Furthermore, STATISTICA provides the option to automatically or manually recalculate all spreadsheet functions as the data change. In this demonstration, see how STATISTICA’s “type ahead” feature recognizes functions and prompts for the necessary parameters.
	Select Output What is your preference for showing the results of your analyses? See how the various output options in STATISTICA let you work the way you want. View your results in individual windows, store output in a convenient workbook, or annotate results in a presentation-quality report. The Output Manager gives you complete control and remembers your preferences.
>	Microsoft Office Integration STATISTICA is fully integrated with Microsoft Office products. This demonstration shows how to output STATISTICA results to Microsoft Word and open a Microsoft Excel document in STATISTICA.
	Workbook Multi-item Display STATISTICA multi-item display enables you to quickly view and edit all documents within a workbook. This video demonstrates how to view multi-item displays, print and save multi-item displays as PDF files, and customize STATISTICA documents within the multi-item display grid.
	Reports In PDF Format With STATISTICA, you can easily create reports in Acrobat (PDF) format for all STATISTICA document types. This powerful feature enables you to share documents with colleagues who have a PDF Reader such as Adobe Acrobat Reader. This video demonstrates how to save and print all STATISTICA document types as PDF files.
	Categories of Graphs In addition to the specialized statistical graphs that are available through all results dialog boxes in STATISTICA, there are two general types of graphs in STATISTICA. In this demonstration, these two graph types are explored: Graphs of Input Data, which visualize or summarize raw values from input data spreadsheets, and Graphs of Block Data, which visualize arbitrarily selected blocks of values, regardless of their source.
	Auto-Update Graphs The dynamic features to automatically update graphs facilitate the visual exploration of data. STATISTICA Graphs are dynamically linked to the data. Thus, when the data change, the graphs automatically update. This video demonstration explores how this functionality can be used for data correction and how to glean important patterns visually from the data, as well as how to create custom graph templates.
	Create Random Sub-Samples During exploratory analysis of very large data sets, it may be best to perform a variety of preliminary analyses using a subset of data. When all the data cases are equally important and a smaller but fully representative subset of the data is sufficient, it is beneficial to use STATISTICA’s options for creating new data files containing random subsets of data contained in the parent files. See how a random subset is created from a file containing 100,000 data cases.
	Use Microscrolls In this demonstration, see how microscrolls, a flexible interface with full mouse and keyboard support, aid interactive input of numerical values in STATISTICA. Microscrolls are available in every dialog with numerical input options, and greatly increase the speed and efficiency of the user interface.
	ActiveX Controls With STATISTICA, you can embed Active X controls into graphs. Active X controls provide the capability to create a custom user interface. This video demonstrates the use of a slider control and how it can be used to create a highly interactive graph.
	Web-Browser This demonstration shows how browser windows in STATISTICA are useful for viewing STATISTICA Enterprise Server reports, as well as viewing custom-made web interfaces that seamlessly interact with STATISTICA.

Posted in Uncategorized

Leave a comment

Tags: academic, analysis, commercial, how to, manuals, software, south africa, statistica, statistics, tutorials, videos

Predictive Modeling Solutions for Banking Industry

May 28

Posted by statsoftsa

To understand customer needs, preferences, and behaviors, financial institutions such as banks, mortgage lenders, credit card companies, and investment advisors are turning to the powerful data mining techniques in STATISTICA Data Miner. These techniques help companies in the financial sector to uncover hidden trends and explain the patterns that affect every aspect of their overall success.

Financial institutions have long collected detailed customer data – oftentimes in many disparate databases and in various formats. Only with the recent advances in database technology and data mining software have financial institutions acquired the necessary tools to manage their risks using all available information, and exploring a wide range of scenarios. Now, business strategies in financial institutions are developed more intelligently than ever before.

Risk Management, Credit Scorecard

STATISTICA Scorecard aids with the development, evaluation and monitoring of scorecard models.

Fraud Detection

Banking fraud attempts have seen a drastic increase in recent years, making fraud detection more important than ever. Despite efforts on the part of financial institutions, hundreds of millions of dollars are lost to fraud every year.

STATISTICA Data Miner helps banks and financial institutions to anticipate and quickly detect fraud and take immediate action to minimize costs. Through the use of sophisticated data mining tools, millions of transactions can be searched to spot patterns and detect fraudulent transactions.

Identify causes of risk; create sophisticated and automated models of risk.

Segment and predict behavior of homogeneous (similar) groups of customers.
Uncover hidden correlations between different indicators.
Create models to price futures, options, and stocks.
Optimize portfolio performance.

Tools and Techniques

STATISTICA Data Miner will empower your organization to provide better services and enhance the profitability of all aspects of your customer relationships. Predict customer behaviour with STATISTICA Data Miner’s General Classifier and Regression tools to find rules for organizing customers into classes or groups. Find out who your most profitable, loyal customers are and who is more likely to default on loans or miss a payment. Apply state-of-the-art techniques to build and compare a wide variety of linear, non-linear, decision-tree based, or neural networks models.

Recognize patterns, segments, and clusters with STATISTICA Data Miner’s Cluster Analysis options and Generalized EM (Expectation Maximization) and K-means Clustering module. For example, clustering methods may help build a customer segmentation model from large data sets. Use the various methods for mapping customers and/or characteristics of customers and customer interactions, such as multidimensional scaling, factor analysis, correspondence analysis, etc., to detect the general rules that apply to your exchanges with your customers.

STATISTICA Data Miner’s powerful Neural Networks Explorer offers tools including classification, hidden structure detection, and forecasting coupled with an Intelligent Wizard to make even the most complex problems and advanced analyses seem easier.

Uncover the most important variables from among thousands of potential measures with Data Miner’s Feature Selection and Variable Filtering module, or simplify the data variables and fields using the Principal Components Analysis or Partial Least Squares modules.

Advanced forecasting methods

STATISTICA Data Miner also features Linear and Nonlinear Multiple Regression with link functions, Neural Networks, ARIMA, Exponentially Weighted Moving Average, Fourier Analysis, and many others. Learn from the data available to you, provide better services, and gain competitive advantages when you apply the absolute state-of-the-art in data mining techniques such as generalized linear and additive models, MARSplines, boosted trees, etc.

Posted in Uncategorized

Leave a comment

Tags: analysis, banking, modelling, predictive, software, solutions, south africa, statistica, statistics

STATISTICA Multivariate Exploratory Techniques

May 24

Posted by statsoftsa

STATISTICA Multivariate Exploratory Techniques offers a broad selection of exploratory techniques, from cluster analysis to advanced classification trees methods, with an endless array of interactive visualization tools for exploring relationships and patterns; built-in complete Visual Basic scripting.

Cluster Analysis Techniques
Factor Analysis
Principal Components & Classification Analysis
Canonical Correlation Analysis
Reliability/Item Analysis
Classification Trees
Correspondence Analysis
Multidimensional Scaling
Discriminant Analysis
General Discriminant Analysis Models (GDA)

Details

Cluster Analysis

Cluster Analysis This module includes a comprehensive implementation of clustering methods (k-means, hierarchical clustering, two-way joining). The program can process data from either raw data files or matrices of distance measures. The user can cluster cases, variables, or both based on a wide variety of distance measures (including Euclidean, squared Euclidean, City-block (Manhattan), Chebychev, Power distances, Percent disagreement, and 1-r) and amalgamation/linkage rules (including single, complete, weighted and unweighted group average or centroid, Ward’s method, and others). Matrices of distances can be saved for further analysis with other modules of the STATISTICA system. In k-means clustering, the user has full control over the initial cluster centers. Extremely large analysis designs can be processed; for example, hierarchical (tree) joining can analyze matrices with over 1,000 variables, or with over 1 million distances. In addition to the standard cluster analysis output, a comprehensive set of descriptive statistics and extended diagnostics (e.g., the complete amalgamation schedule with cohesion levels in hierarchical clustering, the ANOVA table in k-means clustering) is available. Cluster membership data can be appended to the current data file for further processing. Graphics options in the Cluster Analysis module include customizable tree diagrams, discrete contour-style two-way joining matrix plots, plots of amalgamation schedules, plots of means in k-means clustering, and many others.

Factor Analysis

The Factor Analysis module contains a wide range of statistics and options, and provides a comprehensive implementation of factor (and hierarchical factor) analytic techniques with extended diagnostics and a wide variety of analytic and exploratory graphs. It will perform principal components, common, and hierarchical (oblique) factor analysis, and can handle extremely large analysis problems (e.g., with thousands of variables). Confirmatory factor analysis (as well as path analysis) can also be performed via the Structural Equation Modeling and Path Analysis (SEPATH) module found in STATISTICA Advanced Linear/Non-Linear Models.

Principal Components & Classification Analysis

STATISTICA also includes a designated program for principal components and classification analysis. The output includes eigenvalues (regular, cumulative, relative), factor loadings, factor scores (which can be appended to the input data file, reviewed graphically as icons, and interactively recoded), and a number of more technical statistics and diagnostics. Available rotations include Varimax, Equimax, Quartimax, Biquartimax (either normalized or raw), and Oblique rotations. The factorial space can be plotted and reviewed “slice by slice” in either 2D or 3D scatterplots with labeled variable-points; other integrated graphs include Scree plots, various scatterplots, bar and line graphs, and others. After a factor solution is determined, the user can recalculate (i.e., reconstruct) the correlation matrix from the respective number of factors to evaluate the fit of the factor model. Both raw data files and matrices of correlations can be used as input. Confirmatory factor analysis and other related analyses can be performed with the Structural Equation Modeling and Path Analysis (SEPATH) module available in STATISTICA Advanced Linear/Non-Linear Models, where a designated Confirmatory Factor Analysis Wizard will guide you step by step through the process of specifying the model.

Click here to read a real-life application story using STATISTICA’s Principal Components Analysis tools.

Canonical Correlation Analysis

Canonical Correlation Analysis This module offers a comprehensive implementation of canonical analysis procedures; it can process raw data files or correlation matrices and it computes all of the standard canonical correlation statistics (including eigenvectors, eigenvalues, redundancy coefficients, canonical weights, loadings, extracted variances, significance tests for each root, etc.) and a number of extended diagnostics. The scores of canonical variates can be computed for each case, appended to the data file, and visualized via integrated icon plots. The Canonical Analysis module also includes a variety of integrated graphs (including plots of eigenvalues, canonical correlations, scatterplots of canonical variates, and many others). Note that confirmatory analyses of structural relationships between latent variables can also be performed via the SEPATH (Structural Equation Modeling and Path Analysis) module in STATISTICA Advanced Linear/Non-Linear Models. Advanced stepwise and best-subset selection of predictor variables for MANOVA/MANCOVA designs (with multiple dependent variables) is available in the General Regression Models (GRM) module in STATISTICA Advanced Linear/Non-Linear Models.

Reliability/Item Analysis

Reliability/Item Analysis This module includes a comprehensive selection of procedures for the development and evaluation of surveys and questionnaires. As in all other modules of STATISTICA, extremely large designs can be analyzed. The user can calculate reliability statistics for all items in a scale, interactively select subsets, or obtain comparisons between subsets of items via the “split-half” (or split-part) method. In a single run, the user can evaluate the reliability of a sum-scale as well as subscales. When interactively deleting items, the new reliability is computed instantly without processing the data file again. The output includes correlation matrices and descriptive statistics for items, Cronbach alpha, the standardized alpha, the average inter-item correlation, the complete ANOVA table for the scale, the complete set of item-total statistics (including multiple item-total R‘s), the split-half reliability, and the correlation between the two halves corrected for attenuation. A selection of graphs (including various integrated scatterplots, histograms, line plots and other plots) and a set of interactive what-if procedures are provided to aid in the development of scales. For example, the user can calculate the expected reliability after adding a particular number of items to the scale, and can estimate the number of items that would have to be added to the scale in order to achieve a particular reliability. Also, the user can estimate the correlation corrected for attenuation between the current scale and another measure (given the reliability of the current scale).

Classification Trees

STATISTICA’s Classification Trees module provides a comprehensive implementation of the most recently developed algorithms for efficiently producing and testing the robustness of classification trees (a classification tree is a rule for predicting the class of an object from the values of its predictor variables). STATISTICA Data Miner offers additional advanced methods for tree classifications such as Boosted Trees, Random Forests, General Classification and Regression Tree Models (GTrees) and General CHAID (Chi-square Automatic Interaction Detection) models facilities. Classification trees can be produced using categorical predictor variables, ordered predictor variables, or both, and using univariate splits or linear combination splits.

Classification Trees Analysis options include performing exhaustive splits or discriminant-based splits; unbiased variable selection (as in QUEST); direct stopping rules (as in FACT) or bottom-up pruning (as in C&RT); pruning based on misclassification rates or on the deviance function; generalized Chi-square, G-square, or Gini-index goodness of fit measures. Priors and misclassification costs can be specified as equal, estimated from the data, or user-specified. The user can also specify the v value for v-fold cross-validation during tree building, v value for v-fold cross-validation for error estimation, size of the SE rule, minimum node size before pruning, seeds for random number generation, and alpha value for variable selection. Integrated graphics options are provided to explore the input and output data.

Correspondence Analysis

This module features a full implementation of simple and multiple correspondence analysis techniques, and can analyze even extremely large tables. The program will accept input data files with grouping (coding) variables that are to be used to compute the crosstabulation table, data files that contain frequencies (or some other measure of correspondence, association, similarity, confusion, etc.) and coding variables that identify (enumerate) the cells in the input table, or data files with frequencies (or other measure of correspondence) only (e.g., the user can directly type in and analyze a frequency table). For multiple correspondence analysis, the user can also directly specify a Burt table as input for the analysis. The program will compute various tables, including the table of row percentages, column percentages, total percentages, expected values, observed minus expected values, standardized deviates, and contributions to the Chi-square values. The Correspondence Analysis module will compute the generalized eigenvalues and eigenvectors, and report all standard diagnostics including the singular values, eigenvalues, and proportions of inertia for each dimension. The user can either manually choose the number of dimensions, or specify a cutoff value for the maximum cumulative percent of inertia. The program will compute the standard coordinate values for column and row points. The user has the choice of row-profile standardization, column-profile standardization, row and column profile standardization, or canonical standardization. For each dimension and row or column point, the program will compute the inertia, quality, and cosine-square values. In addition, the user can display (in spreadsheets) the matrices of the generalized singular vectors; like the values in all spreadsheets, these matrices can be accessed via STATISTICA Visual Basic, for example, in order to implement non-standard methods of computing the coordinates. The user can compute coordinate values and related statistics (quality and cosine-square values) for supplementary points (row or column), and compare the results with the regular row and column points. Supplementary points can also be specified for multiple correspondence analysis. In addition to the 3D histograms that can be computed for all tables, the user can produce a line plot for the eigenvalues, and 1D, 2D, and 3D plots for the row or column points. Row and column points can also be combined in a single graph, along with any supplementary points (each type of point will use a different color and point marker, so the different types of points can easily be identified in the plots). All points are labeled, and an option is available to truncate the names for the points to a user-specified number of characters.

Multidimensional Scaling

The Multidimensional Scaling module includes a full implementation of (nonmetric) multidimensional scaling. Matrices of similarities, dissimilarities, or correlations between variables (i.e., “objects” or cases) can be analyzed. The starting configuration can be computed by the program (via principal components analysis) or specified by the user. The program employs an iterative procedure to minimize the stress value and the coefficient of alienation. The user can monitor the iterations and inspect the changes in these values. The final configurations can be reviewed via spreadsheets, and via 2D and 3D scatterplots of the dimensional space with labeled item-points. The output includes the values for the raw stress (raw F), Kruskal stress coefficient S, and the coefficient of alienation. The goodness of fit can be evaluated via Shepard diagrams (with d-hats and d-stars). Like all other results in STATISTICA, the final configuration can be saved to a data file.

Discriminant Analysis

The Discriminant Analysis module is a full implementation of multiple stepwise discriminant function analysis. STATISTICA also includes the General Discriminant Analysis Models module (below) for fitting ANOVA/ANCOVA-like designs to categorical dependent variables, and to perform various advanced types of analyses (e.g., best subset selection of predictors, profiling of posterior probabilities, etc.). The Discriminant Analysis program will perform forward or backward stepwise analyses, or enter user-specified blocks of variables into the model.

In addition to the numerous graphics and diagnostics describing the discriminant functions, the program also provides a wide range of options and statistics for the classification of old or new cases (for validation of the model). The output includes the respective Wilks’ lambdas, partial lambdas, F to enter (or remove), the p levels, the tolerance values, and the R-square. The program will perform a full canonical analysis and report the raw and cumulative eigenvalues for all roots, and their p levels, the raw and standardized discriminant (canonical) function coefficients, the structure coefficient matrix (of factor loadings), the means for the discriminant functions, and the discriminant scores for each case (which can also be automatically appended to the data file). Integrated graphs include histograms of the canonical scores within each group (and all groups combined), special scatterplots for pairs of canonical variables (where group membership of individual cases is visibly marked), a comprehensive selection of categorized (multiple) graphs allowing the user to explore the distribution and relations between dependent variables across the groups (including multiple box-and-whisker plots, histograms, scatterplots, and probability plots), and many others. The Discriminant Analysis module will also compute the standard classification functions for each group. The classification of cases can be reviewed in terms of Mahalanobis distances, posterior probabilities, or actual classifications, and the scores for individual cases can be visualized via exploratory icon plots and other multidimensional graphs integrated directly with the results spreadsheets. All of these values can be automatically appended to the current data file for further analyses. The summary classification matrix of the number and percent of correctly classified cases can also be displayed. The user has several options to specify the a priori classification probabilities and can specify selection conditions to include or exclude selected cases from the classification (e.g., to validate the classification functions in a new sample).

General Discriminant Analysis Models (GDA)

The STATISTICA General Discriminant Analysis (GDA) module is an application and extension of the General Linear Model to classification problems. Like the Discriminant Analysis module, GDA allows you to perform standard and stepwise discriminant analyses. GDA implements the discriminant analysis problem as a special case of the general linear model, and thereby offers extremely useful analytic techniques that are innovative, efficient, and extremely powerful. As in traditional discriminant analysis, GDA allows you to specify a categorical dependent variable. For the analysis, the group membership (with regard to the dependent variable) is then coded into indicator variables, and all methods of GRM can be applied. In the results dialogs, the extensive selection of residual statistics of GRM and GLM are available in GDA as well. GDA provides powerful and efficient tools for data mining as well as applied research. GDA will compute all standard results for discriminant analysis, including discriminant function coefficients, canonical analysis results (standardized and raw coefficients, step-down tests of canonical roots, etc.), classification statistics (including Mahalanobis distances, posterior probabilities, actual classification of cases in the analysis sample and validation sample, misclassification matrix, etc.), and so on.

Computational approach and unique applications. As in traditional discriminant analysis, GDA allows you to specify a categorical dependent variable. For the analysis, the group membership (with regard to the dependent variable) is then coded into indicator variables, and all methods of GRM can be applied. In the results dialogs, the extensive selection of residual statistics of GRM and GLM are available in GDA as well; for example, you can review all the regression-like residuals and predicted values for each group (each coded dependent indicator variable), and choose from the large number of residual plots. In addition, all specialized prediction and classification statistics are computed that are commonly reviewed in a discriminant analysis; but those statistics can be reviewed in innovate ways because of STATISTICA’s unique approach. For example, you can perform “desirability profiling” by combining the posterior prediction probabilities for the groups into a desirability score, and then let the program find the values or combination of categorical predictor settings that will optimize that score. Thus, GDA provides powerful and efficient tools for data mining as well as applied research; for example, you could use the DOE (Design of Experiments) methods to generate an experimental design for quality improvement, apply this design to categorical outcome data (e.g., distinct classifications of an outcome as “superior,” “acceptable,” or “failed”), and then model the posterior prediction probabilities of those outcomes using the variables of your experimental design.

Standard discriminant analysis results. STATISTICA GDA will compute all standard results for discriminant analysis, including discriminant function coefficients, canonical analysis results (standardized and raw coefficients, step-down tests of canonical roots, etc.), classification statistics (including Mahalanobis distances, posterior probabilities, actual classification of cases in the analysis sample and validation sample, misclassification matrix, etc.), and so on.

Unique features of GDA, currently only available in STATISTICA. In addition, STATISTICA GDA includes numerous unique features and results:

Specifying predictor variables and effects; model building:

1. Support for continuous and categorical predictors, instead of allowing only continuous predictors in the analysis (the common limitation in traditional discriminant function analysis programs), GDA allows the user to specify simple and complex ANOVA and ANCOVA-like designs, e.g., mixtures of continuous and categorical predictors, polynomial (response surface) designs, factorial designs, nested designs, etc.

2. Multiple-degree of freedom effects in stepwise selection; the terms that make up the predictor set (consisting not only of single-degree of freedom continuous predictors, but also multiple-degree of freedom effects) can be used in stepwise discriminant function analyses; multiple-degree of freedom effects will always be entered/removed as blocks.

3. Best subset selection of predictor effects; single- and multiple-degree of freedom effects can be specified for best-subset discriminant analysis; the program will select the effects (up to a user-specified number of effects) that produce the best discrimination between groups.

4. Selection of predictor effects based on misclassification rates; GDA allows the user to perform model building (selection of predictor effects) not only based on traditional criteria (e.g., p-to-enter/remove; Wilks’ lambda), but also based on misclassification rates; in other words the program will select those predictor effects that maximize the accuracy of classification, either for those cases from which the parameter estimates were computed, or for a cross-validation sample (to guard against over-fitting); these techniques elevate GDA to the level of a fast neural-network-like data mining tool for classification, that can be used as an alternative to other similar techniques (tree-classifiers, designated neural-network methods, etc.; GDA will tend to be faster than those techniques because it is still based on the more efficient General Linear Model).

Results statistics; profiling:

1. Detailed results and diagnostic statistics and plots; in addition to the standard results statistics, GDA provides a large number of auxiliary information to help the user judge the adequacy of the chosen disciminant analysis model (descriptive statistics and graphs, Mahalanobis distances, Cook distances, and leverages for predictors, etc.). 2. Profiling of expected classification; GDA includes an adaptation of the general GLM (GRM) response profiler; these options allow the user to quickly determine the values (or levels) of the predictor variables that maximize the posterior classification probability for a single group, or for a set of groups in the analyses; in a sense, the user can quickly determine the typical profiles of values of the predictors (or levels of categorical predictors) that identify a group (or set of groups) in the analysis.

A note of caution for models with categorical predictors, and other advanced techniques. The General Discriminant Analysis module provides functionality that makes this technique a general tool for classification and data mining. However, most — if not all — textbook treatments of discriminant function analysis are limited to simple and stepwise analyses with single degree of freedom continuous predictors. No “experience” (in the literature) exists regarding issues of robustness and effectiveness of these techniques, when they are generalized in the manner provided in this very powerful module. The use of best-subset methods, in particular when used in conjunction with categorical predictors or when using the misclassification rates in a crossvalidation sample for choosing the best subset of predictors, should be considered a heuristic search method, rather than a statistical analysis technique.

System Requirements

STATISTICA Multivariate Exploratory Techniques is compatible with Windows XP, Windows Vista, and Windows 7.

Minimum System Requirements

Operating System: Windows XP or above
RAM: 256 MB
Processor Speed: 500 MHz

Recommended System Requirements

Operating System: Windows XP or above
RAM: 1 GB
Processor Speed: 2.0 GHz

Native 64-bit versions and highly optimized multiprocessor versions are available.

Posted in Uncategorized

Leave a comment

Tags: analysis, applications, multivariate, software, south africa, statistica, statistics, techniques

Getting Started with Statistics Concepts

May 23

Posted by statsoftsa

Getting Started with Statistics Concepts

In this introduction, we will briefly discuss those elementary statistical concepts that provide the necessary foundations for more specialized expertise in any area of statistical data analysis. The selected topics illustrate the basic assumptions of most statistical methods and/or have been demonstrated in research to be necessary components of our general understanding of the “quantitative nature” of reality (Nisbett, et al., 1987). We will focus mostly on the functional aspects of the concepts discussed and the presentation will be very short. Further information on each of the concepts can be found in statistical textbooks. Recommended introductory textbooks are: Kachigan (1986), and Runyon and Haber (1976); for a more advanced discussion of elementary theory and assumptions of statistics, see the classic books by Hays (1988), and Kendall and Stuart (1979).

What are Variables?

Variables are things that we measure, control, or manipulate in research. They differ in many respects, most notably in the role they are given in our research and in the type of measures that can be applied to them.

Correlational vs. Experimental Research

Most empirical research belongs clearly to one of these two general categories. In correlational research, we do not (or at least try not to) influence any variables but only measure them and look for relations (correlations) between some set of variables, such as blood pressure and cholesterol level. In experimental research, we manipulate some variables and then measure the effects of this manipulation on other variables. For example, a researcher might artificially increase blood pressure and then record cholesterol level. Data analysis in experimental research also comes down to calculating “correlations” between variables, specifically, those manipulated and those affected by the manipulation. However, experimental data may potentially provide qualitatively better information: only experimental data can conclusively demonstrate causal relations between variables. For example, if we found that whenever we change variable A then variable B changes, then we can conclude that “A influences B.” Data from correlational research can only be “interpreted” in causal terms based on some theories that we have, but correlational data cannot conclusively prove causality.

Dependent vs. Independent Variables

Independent variables are those that are manipulated whereas dependent variables are only measured or registered. This distinction appears terminologically confusing to many because, as some students say, “all variables depend on something.” However, once you get used to this distinction, it becomes indispensable. The terms dependent and independent variable apply mostly to experimental research where some variables are manipulated, and in this sense they are “independent” from the initial reaction patterns, features, intentions, etc. of the subjects. Some other variables are expected to be “dependent” on the manipulation or experimental conditions. That is to say, they depend on “what the subject will do” in response. Somewhat contrary to the nature of this distinction, these terms are also used in studies where we do not literally manipulate independent variables, but only assign subjects to “experimental groups” based on some pre-existing properties of the subjects. For example, if in an experiment, males are compared to females regarding their white cell count (WCC), Gender could be called the independent variable and WCC the dependent variable.

Measurement Scales

Variables differ in how well they can be measured, i.e., in how much measurable information their measurement scale can provide. There is obviously some measurement error involved in every measurement, which determines the amount of information that we can obtain. Another factor that determines the amount of information that can be provided by a variable is its type of measurement scale. Specifically, variables are classified as (a) nominal, (b) ordinal, (c) interval, or (d) ratio.

Nominal variables allow for only qualitative classification. That is, they can be measured only in terms of whether the individual items belong to some distinctively different categories, but we cannot quantify or even rank order those categories. For example, all we can say is that two individuals are different in terms of variable A (e.g., they are of different race), but we cannot say which one “has more” of the quality represented by the variable. Typical examples of nominal variables are gender, race, color, city, etc.
Ordinal variables allow us to rank order the items we measure in terms of which has less and which has more of the quality represented by the variable, but still they do not allow us to say “how much more.” A typical example of an ordinal variable is the socioeconomic status of families. For example, we know that upper-middle is higher than middle but we cannot say that it is, for example, 18% higher. Also, this very distinction between nominal, ordinal, and interval scales itself represents a good example of an ordinal variable. For example, we can say that nominal measurement provides less information than ordinal measurement, but we cannot say “how much less” or how this difference compares to the difference between ordinal and interval scales.
Interval variables allow us not only to rank order the items that are measured, but also to quantify and compare the sizes of differences between them. For example, temperature, as measured in degrees Fahrenheit or Celsius, constitutes an interval scale. We can say that a temperature of 40 degrees is higher than a temperature of 30 degrees, and that an increase from 20 to 40 degrees is twice as much as an increase from 30 to 40 degrees.
Ratio variables are very similar to interval variables; in addition to all the properties of interval variables, they feature an identifiable absolute zero point, thus, they allow for statements such as x is two times more than y. Typical examples of ratio scales are measures of time or space. For example, as the Kelvin temperature scale is a ratio scale, not only can we say that a temperature of 200 degrees is higher than one of 100 degrees, we can correctly state that it is twice as high. Interval scales do not have the ratio property. Most statistical data analysis procedures do not distinguish between the interval and ratio properties of the measurement scales.

Relations between Variables

Regardless of their type, two or more variables are related if, in a sample of observations, the values of those variables are distributed in a consistent manner. In other words, variables are related if their values systematically correspond to each other for these observations. For example, Gender and WCC would be considered to be related if most males had high WCC and most females low WCC, or vice versa; Height is related to Weight because, typically, tall individuals are heavier than short ones; IQ is related to the Number of Errors in a test if people with higher IQ’s make fewer errors.

Why Relations between Variables are Important

Generally speaking, the ultimate goal of every research or scientific analysis is to find relations between variables. The philosophy of science teaches us that there is no other way of representing “meaning” except in terms of relations between some quantities or qualities; either way involves relations between variables. Thus, the advancement of science must always involve finding new relations between variables. Correlational research involves measuring such relations in the most straightforward manner. However, experimental research is not any different in this respect. For example, the above mentioned experiment comparing WCC in males and females can be described as looking for a correlation between two variables: Gender and WCC. Statistics does nothing else but help us evaluate relations between variables. Actually, all of the hundreds of procedures that are described in this online textbook can be interpreted in terms of evaluating various kinds of inter-variable relations.

Two Basic Features of Every Relation between Variables

The two most elementary formal properties of every relation between variables are the relation’s (a) magnitude (or “size”) and (b) its reliability (or “truthfulness”).

Magnitude (or “size”). The magnitude is much easier to understand and measure than the reliability. For example, if every male in our sample was found to have a higher WCC than any female in the sample, we could say that the magnitude of the relation between the two variables (Gender and WCC) is very high in our sample. In other words, we could predict one based on the other (at least among the members of our sample).
Reliability (or “truthfulness”). The reliability of a relation is a much less intuitive concept, but still extremely important. It pertains to the “representativeness” of the result found in our specific sample for the entire population. In other words, it says how probable it is that a similar relation would be found if the experiment was replicated with other samples drawn from the same population. Remember that we are almost never “ultimately” interested only in what is going on in our sample; we are interested in the sample only to the extent it can provide information about the population. If our study meets some specific criteria (to be mentioned later), then the reliability of a relation between variables observed in our sample can be quantitatively estimated and represented using a standard measure (technically called p-value or statistical significance level, see the next paragraph).

What is “Statistical Significance” (p-value)?

The statistical significance of a result is the probability that the observed relationship (e.g., between variables) or a difference (e.g., between means) in a sample occurred by pure chance (“luck of the draw”), and that in the population from which the sample was drawn, no such relationship or differences exist. Using less technical terms, we could say that the statistical significance of a result tells us something about the degree to which the result is “true” (in the sense of being “representative of the population”).

More technically, the value of the p-value represents a decreasing index of the reliability of a result (see Brownlee, 1960). The higher the p-value, the less we can believe that the observed relation between variables in the sample is a reliable indicator of the relation between the respective variables in the population. Specifically, the p-value represents the probability of error that is involved in accepting our observed result as valid, that is, as “representative of the population.” For example, a p-value of .05 (i.e.,1/20) indicates that there is a 5% probability that the relation between the variables found in our sample is a “fluke.” In other words, assuming that in the population there was no relation between those variables whatsoever, and we were repeating experiments such as ours one after another, we could expect that approximately in every 20 replications of the experiment there would be one in which the relation between the variables in question would be equal or stronger than in ours. (Note that this is not the same as saying that, given that there IS a relationship between the variables, we can expect to replicate the results 5% of the time or 95% of the time; when there is a relationship between the variables in the population, the probability of replicating the study and finding that relationship is related to the statistical power of the design. See also, Power Analysis). In many areas of research, the p-value of .05 is customarily treated as a “border-line acceptable” error level.

How to Determine that a Result is “Really” Significant

There is no way to avoid arbitrariness in the final decision as to what level of significance will be treated as really “significant.” That is, the selection of some level of significance, up to which the results will be rejected as invalid, is arbitrary. In practice, the final decision usually depends on whether the outcome was predicted a priori or only found post hoc in the course of many analyses and comparisons performed on the data set, on the total amount of consistent supportive evidence in the entire data set, and on “traditions” existing in the particular area of research. Typically, in many sciences, results that yield p .05 are considered borderline statistically significant, but remember that this level of significance still involves a pretty high probability of error (5%). Results that are significant at the p .01 level are commonly considered statistically significant, and p .005 or p .001 levels are often called “highly” significant. But remember that these classifications represent nothing else but arbitrary conventions that are only informally based on general research experience.

Statistical Significance and the Number of Analyses Performed

Needless to say, the more analyses you perform on a data set, the more results will meet “by chance” the conventional significance level. For example, if you calculate correlations between ten variables (i.e., 45 different correlation coefficients), then you should expect to find by chance that about two (i.e., one in every 20) correlation coefficients are significant at the p .05 level, even if the values of the variables were totally random and those variables do not correlate in the population. Some statistical methods that involve many comparisons and, thus, a good chance for such errors include some “correction” or adjustment for the total number of comparisons. However, many statistical methods (especially simple exploratory data analyses) do not offer any straightforward remedies to this problem. Therefore, it is up to the researcher to carefully evaluate the reliability of unexpected findings. Many examples in this online textbook offer specific advice on how to do this; relevant information can also be found in most research methods textbooks.

Strength vs. Reliability of a Relation between Variables

We said before that strength and reliability are two different features of relationships between variables. However, they are not totally independent. In general, in a sample of a particular size, the larger the magnitude of the relation between variables, the more reliable the relation (see the next paragraph).

Why Stronger Relations between Variables are More Significant

Assuming that there is no relation between the respective variables in the population, the most likely outcome would be also finding no relation between these variables in the research sample. Thus, the stronger the relation found in the sample, the less likely it is that there is no corresponding relation in the population. As you see, the magnitude and significance of a relation appear to be closely related, and we could calculate the significance from the magnitude and vice-versa; however, this is true only if the sample size is kept constant, because the relation of a given strength could be either highly significant or not significant at all, depending on the sample size (see the next paragraph).

Why Significance of a Relation between Variables Depends on the Size of the Sample

If there are very few observations, then there are also respectively few possible combinations of the values of the variables and, thus, the probability of obtaining by chance a combination of those values indicative of a strong relation is relatively high.

Consider the following illustration. If we are interested in two variables (Gender: male/female and WCC: high/low), and there are only four subjects in our sample (two males and two females), then the probability that we will find, purely by chance, a 100% relation between the two variables can be as high as one-eighth. Specifically, there is a one-in-eight chance that both males will have a high WCC and both females a low WCC, or vice versa.

Now consider the probability of obtaining such a perfect match by chance if our sample consisted of 100 subjects; the probability of obtaining such an outcome by chance would be practically zero.

Let’s look at a more general example. Imagine a theoretical population in which the average value of WCC in males and females is exactly the same. Needless to say, if we start replicating a simple experiment by drawing pairs of samples (of males and females) of a particular size from this population and calculating the difference between the average WCC in each pair of samples, most of the experiments will yield results close to 0. However, from time to time, a pair of samples will be drawn where the difference between males and females will be quite different from 0. How often will it happen? The smaller the sample size in each experiment, the more likely it is that we will obtain such erroneous results, which in this case would be results indicative of the existence of a relation between Gender and WCC obtained from a population in which such a relation does not exist.

Example: Baby Boys to Baby Girls Ratio

Consider this example from research on statistical reasoning (Nisbett, et al., 1987). There are two hospitals: in the first one, 120 babies are born every day; in the other, only 12. On average, the ratio of baby boys to baby girls born every day in each hospital is 50/50. However, one day, in one of those hospitals, twice as many baby girls were born as baby boys. In which hospital was it more likely to happen? The answer is obvious for a statistician, but as research shows, not so obvious for a lay person: it is much more likely to happen in the small hospital. The reason for this is that technically speaking, the probability of a random deviation of a particular size (from the population mean), decreases with the increase in the sample size.

Why Small Relations Can be Proven Significant Only in Large Samples

The examples in the previous paragraphs indicate that if a relationship between variables in question is “objectively” (i.e., in the population) small, then there is no way to identify such a relation in a study unless the research sample is correspondingly large. Even if our sample is in fact “perfectly representative,” the effect will not be statistically significant if the sample is small. Analogously, if a relation in question is “objectively” very large, then it can be found to be highly significant even in a study based on a very small sample.

Consider this additional illustration. If a coin is slightly asymmetrical and, when tossed, is somewhat more likely to produce heads than tails (e.g., 60% vs. 40%), then ten tosses would not be sufficient to convince anyone that the coin is asymmetrical even if the outcome obtained (six heads and four tails) was perfectly representative of the bias of the coin. However, is it so that 10 tosses is not enough to prove anything? No; if the effect in question were large enough, then ten tosses could be quite enough. For instance, imagine now that the coin is so asymmetrical that no matter how you toss it, the outcome will be heads. If you tossed such a coin ten times and each toss produced heads, most people would consider it sufficient evidence that something is wrong with the coin. In other words, it would be considered convincing evidence that in the theoretical population of an infinite number of tosses of this coin, there would be more heads than tails. Thus, if a relation is large, then it can be found to be significant even in a small sample.

Can “No Relation” be a Significant Result?

The smaller the relation between variables, the larger the sample size that is necessary to prove it significant. For example, imagine how many tosses would be necessary to prove that a coin is asymmetrical if its bias were only .000001%! Thus, the necessary minimum sample size increases as the magnitude of the effect to be demonstrated decreases. When the magnitude of the effect approaches 0, the necessary sample size to conclusively prove it approaches infinity. That is to say, if there is almost no relation between two variables, then the sample size must be almost equal to the population size, which is assumed to be infinitely large. Statistical significance represents the probability that a similar outcome would be obtained if we tested the entire population. Thus, everything that would be found after testing the entire population would be, by definition, significant at the highest possible level, and this also includes all “no relation” results.

How to Measure the Magnitude (Strength) of Relations between Variables

There are very many measures of the magnitude of relationships between variables that have been developed by statisticians; the choice of a specific measure in given circumstances depends on the number of variables involved, measurement scales used, nature of the relations, etc. Almost all of them, however, follow one general principle: they attempt to somehow evaluate the observed relation by comparing it to the “maximum imaginable relation” between those specific variables.

Technically speaking, a common way to perform such evaluations is to look at how differentiated the values are of the variables, and then calculate what part of this “overall available differentiation” is accounted for by instances when that differentiation is “common” in the two (or more) variables in question. Speaking less technically, we compare “what is common in those variables” to “what potentially could have been common if the variables were perfectly related.”

Let’s consider a simple illustration. Let’s say that in our sample, the average index of WCC is 100 in males and 102 in females. Thus, we could say that on average, the deviation of each individual score from the grand mean (101) contains a component due to the gender of the subject; the size of this component is 1. That value, in a sense, represents some measure of relation between Gender and WCC. However, this value is a very poor measure because it does not tell us how relatively large this component is given the “overall differentiation” of WCC scores. Consider two extreme possibilities:

If all WCC scores of males were equal exactly to 100 and those of females equal to 102, then all deviations from the grand mean in our sample would be entirely accounted for by gender. We would say that in our sample, Gender is perfectly correlated with WCC, that is, 100% of the observed differences between subjects regarding their WCC is accounted for by their gender.
If WCC scores were in the range of 0-1000, the same difference (of 2) between the average WCC of males and females found in the study would account for such a small part of the overall differentiation of scores that most likely it would be considered negligible. For example, one more subject taken into account could change, or even reverse the direction of the difference. Therefore, every good measure of relations between variables must take into account the overall differentiation of individual scores in the sample and evaluate the relation in terms of (relatively) how much of this differentiation is accounted for by the relation in question.

Common “General Format” of Most Statistical Tests

Because the ultimate goal of most statistical tests is to evaluate relations between variables, most statistical tests follow the general format that was explained in the previous paragraph. Technically speaking, they represent a ratio of some measure of the differentiation common in the variables in question to the overall differentiation of those variables. For example, they represent a ratio of the part of the overall differentiation of the WCC scores that can be accounted for by gender to the overall differentiation of the WCC scores. This ratio is usually called a ratio of explained variation to total variation. In statistics, the term explained variation does not necessarily imply that we “conceptually understand” it. It is used only to denote the common variation in the variables in question, that is, the part of variation in one variable that is “explained” by the specific values of the other variable, and vice versa.

How the “Level of Statistical Significance” is Calculated

Let’s assume that we have already calculated a measure of a relation between two variables (as explained above). The next question is “how significant is this relation?” For example, is 40% of the explained variance between the two variables enough to consider the relation significant? The answer is “it depends.”

Specifically, the significance depends mostly on the sample size. As explained before, in very large samples, even very small relations between variables will be significant, whereas in very small samples even very large relations cannot be considered reliable (significant). Thus, in order to determine the level of statistical significance, we need a function that represents the relationship between “magnitude” and “significance” of relations between two variables, depending on the sample size. The function we need would tell us exactly “how likely it is to obtain a relation of a given magnitude (or larger) from a sample of a given size, assuming that there is no such relation between those variables in the population.” In other words, that function would give us the significance (p) level, and it would tell us the probability of error involved in rejecting the idea that the relation in question does not exist in the population. This “alternative” hypothesis (that there is no relation in the population) is usually called the null hypothesis. It would be ideal if the probability function was linear and, for example, only had different slopes for different sample sizes. Unfortunately, the function is more complex and is not always exactly the same; however, in most cases we know its shape and can use it to determine the significance levels for our findings in samples of a particular size. Most of these functions are related to a general type of function, which is called normal.

Why the “Normal Distribution” is Important

The “normal distribution” is important because in most cases, it well approximates the function that was introduced in the previous paragraph (for a detailed illustration, see Are All Test Statistics Normally Distributed?). The distribution of many test statistics is normal or follows some form that can be derived from the normal distribution. In this sense, philosophically speaking, the normal distribution represents one of the empirically verified elementary “truths about the general nature of reality,” and its status can be compared to the one of fundamental laws of natural sciences. The exact shape of the normal distribution (the characteristic “bell curve”) is defined by a function that has only two parameters: mean and standard deviation.

A characteristic property of the normal distribution is that 68% of all of its observations fall within a range of ±1 standard deviation from the mean, and a range of ±2 standard deviations includes 95% of the scores. In other words, in a normal distribution, observations that have a standardized value of less than -2 or more than +2 have a relative frequency of 5% or less. (Standardized value means that a value is expressed in terms of its difference from the mean, divided by the standard deviation.) If you have access to STATISTICA, you can explore the exact values of probability associated with different values in the normal distribution using the interactive Probability Calculator tool; for example, if you enter the Z value (i.e., standardized value) of 4, the associated probability computed by STATISTICA will be less than .0001, because in the normal distribution almost all observations (i.e., more than 99.99%) fall within the range of ±4 standard deviations. The animation below shows the tail area associated with other Z values.

Illustration of How the Normal Distribution is Used in Statistical Reasoning (Induction)

Recall the example discussed above, where pairs of samples of males and females were drawn from a population in which the average value of WCC in males and females was exactly the same. Although the most likely outcome of such experiments (one pair of samples per experiment) was that the difference between the average WCC in males and females in each pair is close to zero, from time to time, a pair of samples will be drawn where the difference between males and females is quite different from 0. How often does it happen? If the sample size is large enough, the results of such replications are “normally distributed” (this important principle is explained and illustrated in the next paragraph) and, thus, knowing the shape of the normal curve, we can precisely calculate the probability of obtaining “by chance” outcomes representing various levels of deviation from the hypothetical population mean of 0. If such a calculated probability is so low that it meets the previously accepted criterion of statistical significance, then we have only one choice: conclude that our result gives a better approximation of what is going on in the population than the “null hypothesis” (remember that the null hypothesis was considered only for “technical reasons” as a benchmark against which our empirical result was evaluated). Note that this entire reasoning is based on the assumption that the shape of the distribution of those “replications” (technically, the “sampling distribution”) is normal. This assumption is discussed in the next paragraph.

Are All Test Statistics Normally Distributed?

Not all, but most of them are either based on the normal distribution directly or on distributions that are related to and can be derived from normal, such as t, F, or Chi-square. Typically, these tests require that the variables analyzed are themselves normally distributed in the population, that is, they meet the so-called “normality assumption.” Many observed variables actually are normally distributed, which is another reason why the normal distribution represents a “general feature” of empirical reality. The problem may occur when we try to use a normal distribution-based test to analyze data from variables that are themselves not normally distributed (see tests of normality in Nonparametrics or ANOVA/MANOVA). In such cases, we have two general choices. First, we can use some alternative “nonparametric” test (or so-called “distribution-free test” see, Nonparametrics); but this is often inconvenient because such tests are typically less powerful and less flexible in terms of types of conclusions that they can provide. Alternatively, in many cases we can still use the normal distribution-based test if we only make sure that the size of our samples is large enough. The latter option is based on an extremely important principle that is largely responsible for the popularity of tests that are based on the normal function. Namely, as the sample size increases, the shape of the sampling distribution (i.e., distribution of a statistic from the sample; this term was first used by Fisher, 1928a) approaches normal shape, even if the distribution of the variable in question is not normal. This principle is illustrated in the following animation showing a series of sampling distributions (created with gradually increasing sample sizes of: 2, 5, 10, 15, and 30) using a variable that is clearly non-normal in the population, that is, the distribution of its values is clearly skewed.

However, as the sample size (of samples used to create the sampling distribution of the mean) increases, the shape of the sampling distribution becomes normal. Note that for n=30, the shape of that distribution is “almost” perfectly normal (see the close match of the fit). This principle is called the central limit theorem (this term was first used by Pólya, 1920; German, “Zentraler Grenzwertsatz”).

How Do We Know the Consequences of Violating the Normality Assumption?

Although many of the statements made in the preceding paragraphs can be proven mathematically, some of them do not have theoretical proof and can be demonstrated only empirically, via so-called Monte-Carlo experiments. In these experiments, large numbers of samples are generated by a computer following predesigned specifications, and the results from such samples are analyzed using a variety of tests. This way we can empirically evaluate the type and magnitude of errors or biases to which we are exposed when certain theoretical assumptions of the tests we are using are not met by our data. Specifically, Monte-Carlo studies were used extensively with normal distribution-based tests to determine how sensitive they are to violations of the assumption of normal distribution of the analyzed variables in the population. The general conclusion from these studies is that the consequences of such violations are less severe than previously thought. Although these conclusions should not entirely discourage anyone from being concerned about the normality assumption, they have increased the overall popularity of the distribution-dependent statistical tests in all areas of research.

Posted in Uncategorized

Leave a comment

An introduction to Regression Analysis

May 22

Posted by statsoftsa

Posted in Statistica

Leave a comment

Tags: regression analysis, south africa, statistica, statistics

STATISTICA for Academia & Research

May 21

Posted by statsoftsa

Since 1984, StatSoft has demonstrated a commitment to bridging the gap between technology and education. It is our mission to supply our customers with the most efficient and effective ways of learning.

StatSoft continues to build relationships where it matters most — in the classroom. We are committed to delivering software solutions and services for academia that will spark innovation and expand educational opportunities around the world.

STATISTICA is used by academia around the world.

STATISTICA for Professors and Students

Individual Professors and Students

Academic Solutions for Professors and Students are designed for individual use and are best suited with the desktop version of STATISTICA. To learn more about the Desktop Version of STATISTICA or request a quote, please follow the links below.

STATISTICA Desktop

STATISTICA for Classrooms, Departments and Campuses

Entire Classrooms and Campus-Wide

StatSoft provides licensing agreements to Universities that wish to license STATISTICA for an entire classroom or even campus-wide. To learn more about these solutions, please contact Academic Sales at 0112346148 or email us at info@statsoft.co.za

WebSTATISTICA

To Request a quote please contact us at: sales@statsoft.co.za / info@statsoft.co.za or alternatively call 0112346148 for more information.

Posted in Uncategorized

Leave a comment

New Poll: Analytics, Data Mining, Big Data Software Used? Poll #Analytics #kdnuggets #datamining #software

May 18

Posted by statsoftsa

Have you used STATISTICA in past 12 months? Take this simple, one-question survey http://brev.is/Xj63 #bigdata #analytics #kdnuggets

Statistica Software

Statsoft Southern Africa Research

Monthly Archives: May 2012

Video Tutorials

Introductory Overview

Risk Management, Credit Scorecard

Fraud Detection

Identify causes of risk; create sophisticated and automated models of risk.

Tools and Techniques

Advanced forecasting methods

Cluster Analysis

Factor Analysis

Principal Components & Classification Analysis

Canonical Correlation Analysis

Reliability/Item Analysis

Classification Trees

Correspondence Analysis

Multidimensional Scaling

Discriminant Analysis

General Discriminant Analysis Models (GDA)

Minimum System Requirements

Recommended System Requirements

Getting Started with Statistics Concepts

What are Variables?

Correlational vs. Experimental Research

Dependent vs. Independent Variables

Measurement Scales

Relations between Variables

Why Relations between Variables are Important

Two Basic Features of Every Relation between Variables

What is “Statistical Significance” (p-value)?

How to Determine that a Result is “Really” Significant

Statistical Significance and the Number of Analyses Performed

Strength vs. Reliability of a Relation between Variables

Why Stronger Relations between Variables are More Significant

Why Significance of a Relation between Variables Depends on the Size of the Sample

Example: Baby Boys to Baby Girls Ratio

Why Small Relations Can be Proven Significant Only in Large Samples

Can “No Relation” be a Significant Result?

How to Measure the Magnitude (Strength) of Relations between Variables

Common “General Format” of Most Statistical Tests

How the “Level of Statistical Significance” is Calculated

Why the “Normal Distribution” is Important

Illustration of How the Normal Distribution is Used in Statistical Reasoning (Induction)

Are All Test Statistics Normally Distributed?

How Do We Know the Consequences of Violating the Normality Assumption?

Individual Professors and Students

Entire Classrooms and Campus-Wide

Contact Info

Archives

Blog Stats