# Monthly Archives: March 2013

## How to Use Breakdown Analysis for Non-Factorial Tables

*STATISTICA *can calculate descriptive statistics for dependent variables in each of a number of groups defined by one or more grouping, or independent, variables. A breakdown analysis is normally used as an exploratory data analysis technique. The typical question that this technique can help answer is very simple: are the groups created by the independent variables different regarding the dependent variable?

There are two types of breakdown analyses available in *STATISTICA*: *Breakdown & one-way ANOVA *and *Breakdown; non-factorial tables.*

The *Breakdown & one-way ANOVA* analysis is used to compute various descriptive statistics, correlation matrices, summary graphs, etc., broken down by groups. This option also enables you to perform complete one-way ANOVAs, and provides tests of homogeneity of variance and post-hoc tests of mean differences.

The *Breakdown; non-factorial tables* analysis is used to compute various descriptive statistics broken down by groups identified by unique combinations of values on the breakdown variables. Unlike the *Breakdown & one-way ANOVA* analysis, the groups specified in the data do not need to define a full factorial table.

This example is based on a data set reported by Finn (1974). Four groups of 12 subjects each were asked to sort a list of 50 words (each printed on one card) into a specified number of categories. The experimental groups differed with regard to the instructions they received concerning the number of categories represented in the word lists and the actual number of categories in the word lists (as “built” into them by the experimenter).

The data set, *Mancova.sta*, is one of the example data sets that is included with *STATISTICA*. To open this spreadsheet, select the ** File** tab. In the left pane, select

**. In the**

*Open Examples***dialog box, double-click on the**

*Open a STATISTICA Data File**Datasets*folder, select the

*Mancova.sta*file, and click the

**button.**

*Open*Suppose we want to produce descriptive statistics of how many words were sorted, broken down by *GROUP* and *CATS*.

On the ** Statistics **tab in the

**group, click**

*Base***to display the**

*Basic Statistics***Startup Panel. Select**

*Basic Statistics and Tables**.*

**Breakdown & one-way ANOVA**

Click the * OK *button to display the

*dialog box.*

**Statistics by Groups (Breakdown)**

On the * Individual tables *tab, click the

*button. In the variable selection dialog box, select*

**Variables***WORDS*as the dependent variable, and select

*GROUP*and

*CATS*as the grouping variables. Click the

*button.*

**OK**In the * Statistics by Groups (Breakdown) *dialog box, click

**. S**

*OK**TATISTICA*will automatically choose all codes for the grouping variables, and the

*dialog box will be displayed.*

**Statistics by Groups – Results**

Note that there are numerous options on the * Descriptives* tab to add various other statistics to the results spreadsheet if desired. For this example, click the

*button to produce the default descriptive statistics.*

**Summary**

Note that this spreadsheet contains results for all group combinations because this analysis assumes the data to be a full factorial design. However, some of the results may not be appropriate to include. Thus, it might be better to consider using *Breakdown; non-factorial tables *instead, since the data is not a full factorial design.

To do this, resume the analysis by clicking CTRL+R or by clicking the ** Statistics by Groups** button on the analysis bar at the lower-left of the

*STATISTICA*window. In the

*dialog box, click*

**Statistics by Groups – Results****, and in the**

*Cancel**dialog box, click*

**Statistics by Groups (Breakdown)***to return to the*

**Cancel****Startup Panel.**

*Basic Statistics and Tables*Select ** Breakdown; non-factorial tables**.

Click the ** OK **button. In the

**dialog box, on the**

*Statistics BreakDown (non-factorial)***tab, click the**

*Quick***button. In the variable selection dialog box, select**

*Variables**WORDS*as the dependent variable, and select

*GROUP*and

*CATS*as the grouping variables. Click

*.*

**OK**Note that more statistics can be added to the results spreadsheet via the ** Descriptives **tab. For this example, click the

**button to produce the default descriptive statistics.**

*Summary*

This spreadsheet contains the descriptive statistics for the relevant group combinations given the data. Note that one combination does not have a standard deviation listed, as it only had a sample size of one in this combination of groups.

## North West University – Success Note

With a simple explanation in a LinkedIn forum, we were able to reduce Vincent Hwendero’s headaches! He wrote us a nice note: “STATISTICA just made my life easy in the application of regression control charts…Thank you STATISTICA.” Vincent is a Jr. Researcher/Assistant Lecturer at North West University.

## Data Mining with STATISTICA 35 Video Sessions

The Data Mining with STATISTICA series is 35 videos covering concepts, process, and hands-on data mining. The intent of this series is to familiarize you with data mining, so that you are empowered to start your own data mining projects and can carry out the projects successfully. You will see the tools that are available, the output they produce with tips about interpreting the output, and how to display project success.

Click here to go to the Data Mining Sessions YouTube Channel

## 10 Answers to Pi Trivia Questions

**QUESTIONS**

- What is the definition of pi (not its estimated numeric value)?
- What 18th Century English mathematician first employed the Greek letter Pi to symbolize the value of #1 above?
- Pi is which letter (first, second, third, etc.) of the Greek alphabet?
- Is it any coincidence that the first two letters of the word “pizza” are P and I? (yes or no)
- Is the value of pi a rational or irrational number?
- The Greek letter Pi is also used in legal shorthand to represent what?
- In what
*Star Trek (TOS)*episode is an evil energy being tricked by Kirk’s crew into attempting a full calculation of pi? - What is the general literary technique that places artificial constraints on an author’s use of words, e.g., limiting word lengths to mimic the sequential digits of pi? (Yes, this has actually been done.)
- In 2010-2011, how many days were needed to calculate the current record of 10 trillion decimal places for pi?
- What is the name given to the practice of memorizing large strings of digits of pi through the use of various mnemonic techniques?

**ANSWERS**

- pi = C/D = the circumference of a circle as related to its diameter, a constant yet irrational value (oops, that’s the answer to #5)
- William Jones of Wales (1706), subsequently popularized by Swiss mathematician Leonhard Euler several decades later
- 16th letter of Greek alphabet
- Yes, total coincidence–but a tasty one!
- Irrational
- Plaintiff
- “Wolf in the Fold” (1967), perhaps popularly remembered as the “Jack the Ripper” episode
- Constrained Writing, as exemplified by
*Cadaeic Cadenza*, for example - 371 days
- Piphilology (a play on the words “pi” and “philology”)

## Using Transparency on Scatterplots to Display Point Density

Many times when using a scatterplot that contains a high density of points, it is difficult to fully understand the data since some points are obscured by other points. Furthermore, there are many cases where the density of points needs to be understood, but this type of analysis cannot always be accomplished with normal scatterplot techniques. To facilitate solutions to both of these problems, it is possible in *STATISTICA *10 to control point transparency in a scatterplot. This example illustrates creating a scatterplot with transparent points in *STATISTICA.
*

The data that I am using is one day’s worth of recorded observations from a power plant (more than 25,000 pairs of data). The variables measured are recorded so often that a scatterplot of the data is usually not useful since all patterns in the data are lost due to a high density of plot points. If we produce the default scatterplot for this data we get this:

Here we see that there seems to be no obvious relationship between our X and Y variables. The problem is that we can’t see the density of the points at each spot, so visually a single outlier has as much impact as a point that represents 100 observations. To try to more accurately display our data graphically, let’s use the new transparency feature.

The transparency slider that controls the plot point transparency is located in the lower-right corner of the scatterplot. I changed the transparency of the points to about 80%.

Now our plot is displaying data in a way that more information can be gained from it. We see that the majority of the points (where the plot is the darkest) falls inside a “bowtie” shape bounded by +/- 250 on the Y axis. This pattern enables us to have a much deeper understanding about the relationship between X and Y. We could not see this very distinctive pattern until we applied point transparency in our scatterplot.

Point transparency is just one of many improvements in graphical displays in *STATSISTICA* 10. We can see from this example that these improvements don’t just produce better looking graphics, they also present new ways to use graphics to glean information from complicated data sets.

-Shannon L. Dick

## 10 Questions About Pi

Aah, tomorrow is Pi Day, which has nothing to do with ** pi**zza

**e other than sharing some common letters. But that didn’t stop me from mocking up a tasty**

*pi***zza**

*pi***e image for this blog post, because it’s lunch time right now as I’m typing this and, frankly, I am hungry.**

*pi*Also,when Pi Day annually sends a thrill up the leg of many a statistician all over the world, celebrations often include the eating of such homophonic foods. So, some kind of * PI*E visual seemed appropriate here. It was going to be either pizza or an image of the Greek alphabet character. And pizza just looked better.

So now, try these ten trivia questions about Pi…

- What is the definition of pi (not the estimated numeric value)?
- What 18th Century English mathematician first employed the Greek letter Pi to symbolize the value of #1 above?
- Pi is which letter (first, second, third, etc.) of the Greek alphabet?
- Is it any coincidence that the first two letters of the word “pizza” are P and I? (yes or no)
- Is the value of pi a rational or irrational number?
- The Greek letter Pi is also used in legal shorthand to represent what?
- In what
*Star Trek (TOS)*episode is an evil energy being tricked by Kirk’s crew into attempting a full calculation of pi? - What is the general literary technique that places artificial constraints on an author’s use of words, e.g., limiting word lengths to mimic the sequential digits of pi? (Yes, this has actually been done.)
- In 2010-2011, how many days were needed to calculate the current record of 10 trillion decimal places for pi?
- What is the name given to the practice of memorizing large strings of digits of pi through the use of various mnemonic techniques?

Hopefully, I shall remember to post the answers later.

## Visualize Data with Color Coding

Want to quickly see (without reading text) if a particular row of data is *bad*? You can always color code the data. You can visualize your data with conditional formating.

Start by downloading ColorBackground.zip. After you unzip, you will see one file, ColorBackground.svb.

To see the value of color coding, start *STATISTICA *and open the downloaded macro. Now open the example dataset, “Cat Clinic.sta”.

If you don’t know how to open example datasets, start by selecting the *File *menu -> *Open Examples* menu. Now open the *Datasets *folder. You will see hundreds of *STATISTICA *spreadsheets. Select “Cat Clinic”.

We are going to see which cats are overweight with a standard BMI (body mass index) formula.

BMI = (weight / (height x height)) x 703

**NOTE**: We know there is a different BMI formula for cats. But our dataset does not include the rib cage measurement, so we are using the human BMI formula

You have the dataset and macro open. The macro has the focus. Type F5 to execute the macro.

Variable 3 (V3) has the weight. Variable 4 (V4) has the cat’s height. So the calculation for an overweight cat is:

(V3/(V4*V4)) * 703 > 24

Click OK. The data is now color coded. Overweight cats are red.

Image Credit: Color Coded Bookshelf

http://www.flickr.com/photos/juhansonin/3254322054/

## StatSoft VP to Present Keynote at Big Data Business Conference

StatSoft, Inc. announces its Vice President of Analytic Solutions, Dr. Thomas Hill, is scheduled to present the keynote address at “The Business of Big Data” conference in Boston this week.

A noted speaker and author, Hill will speak on “The Nature of Insight from Big and Complex Data,” from the viewpoint of human cognitive science as it relates to strategies and approaches in predictive modeling.

The conference will be held at Boston Harbor Hotel by Knowledgent, a USA-based industry information consultancy. Organizers expect the conference to attract C-level decision-makers from financial and other industries primarily interested in the business aspects of big data, rather than just the technology aspects.

Dr. Hill’s keynote will focus on the role of human cognition as an efficient decisioning mechanism when applied against high-velocity big data. Hill posits that some underlying processes that give rise to expertise through interaction with complex and very rich data streams are, in fact, well understood, and that these processes can guide how best to approach predictive and other modeling analyses of big and high-velocity data relevant for key business performance indicators (KPI’s).

As Vice President for Analytic Solutions, Dr. Hill has been involved for over twenty years in the development of data analysis and data mining algorithms, and the delivery of analytic solutions. Dr. Hill also taught data analysis and data mining courses at The University of Tulsa for over ten years. He has received numerous grants and awards from the National Science Foundation, the National Institute of Health, the Center for Innovation Management, Electric Power Research Institute, and other organizations. Hill is a frequent speaker at national and international conferences. He has completed diverse consulting projects with companies from practically all industries, and has worked with leading insurance, financial services, and other companies to identify and refine effective applications for predictive analytic solutions.

He is the author (with Paul Lewicki, 2005) of *Statistics: Methods and Applications* as well as the *Electronic Statistics Textbook* (a popular online resource on statistics and data mining). He has published widely on innovative applications of data mining and predictive analytics, contributed numerous tutorials to the *Handbook of Statistical Analysis and Data Mining Applications* (Elsevier/Academic Press, 2009), and is a co-author of *Practical Text Mining And Statistical Analysis for Non-Structured Text Data Applications* (Elsevier/Academic Press, 2012). Both books won the prestigious PROSE Award from the Association of American Publishers for “pioneering works of research…and design of landmark works in their fields.”

Dr. Hill is also a contributing author to the forthcoming book, *Practical Predictive Analytics and Decisioning Systems for Medicine*, to be published by Academic Press early in 2014.

## Business Intelligence – Solve a Critical Quality Problem

This is a continuation of Predictive Analytics – Solve a Critical Quality Problem. A BioPharmaceutical Manufacturing company was scrapping about 30% of batches, which is very expensive. The company’s engineers tried to solve the problem with various techiques.

But it was not until they started using predictive analytics (also know as data mining) that they uncovered actionable process improvements. These improvements are predicted to lower the scrap rate from around 30% to 5%.

How were these improvements discovered?

**The Data Mining Approach for Root Cause Analysis:** Data mining is a broad term used in a variety of ways, in addition to other terms such as “predictive modeling” or “advanced analytics.”

Here, it means the application of the latest data-driven analytics to build models of a phenomenon, such as a manufacturing process, based on historical data. In a nutshell, in the last 10-15 years, there has been a great leap forward in terms of the flexibility and ease of building models and the amount of data that can be utilized efficiently due to advances in computing hardware.

Data mining has changed the world of analytics

… in a good way.

… forever.

Companies that embrace these changes and learn to apply them will benefit.

Data mining begins with the definition and aggregation of the relevant data. In this case, it was the last 12 months of all the data from the manufacturing process, including:

- raw materials characteristics
- process parameters across the unit operation for each batch
- product quality outcomes on the critical-to-quality responses on which they based their judgment about whether to release the batch or scrap it

Once the relevant data were gathered, StatSoft consultants sat down with the engineering team before we began the model building process. This is a **critical step** and one that you should consider as you adopt data mining.

We asked the engineers questions such as:

- Which factors can you control, and which ones can you not control?
- Which factors are easy to control, and which ones are difficult or expensive to control?

The rationale is that data mining is not an academic exercise when applied to manufacturing. It is being done to improve the process, and that requires action as the end result. A model that is accurate but based solely on parameters that are impossible or expensive to tweak is impractical (which is a nice way of saying ― **useless**).

Empowered with this information, model building is the next step in the data mining process. In short, many data mining model types are applied to the data to determine which one results in the optimal goodness of fit, such as the smallest residuals between predicted and actual values.

Various methods are employed to ensure that the best models are selected. For example, a random hold-out sample of the historical data is used for each model to make predictions. This helps protect against the potential for the model to get very good at predicting one set of historical data to the point at which it is really bad at predicting the outcomes for other batches.

A major advantage of data mining is that you don‘t need to make assumptions ahead of time about the nature of the data and the nature of the relationships between the predictors and the responses. Traditional least squares linear modeling, such as what is taught in Six Sigma classes on the analytic tools, does require this knowledge.

For Root Cause Analysis, most data mining techniques provide importance plots or similar ways to see very quickly which raw materials and process parameters are the major predictors of the outcomes, and, as valuable, which factors don‘t matter.

At this point in the data mining process, StatSoft consultants sat down with the engineering team to review the most important parameters. Typically, there is an active discussion with comments from the engineers such as:

- that can‘t be
- I don‘t see how that parameter would be relevant

The conversation gradually transforms over the course of an hour to:

- I could see how those parameters could interact with the ones later in the process to impact the product quality

Data mining methods are really good at modeling large amounts of data from lots of parameters, a typical situation in manufacturing. Humans are good at thinking about a few factors at a time and interpreting a limited time window of data.

As shown above, the two approaches complement each other, with the results from data mining as important insights about the manufacturing process that can then be evaluated, validated, and utilized by the engineering team to determine:

- Now, what do we do to improve the process? What are the priorities?

The company then planned to implement Process improvements that are predicted to lower the scrap rate of batches from ~30% to ~5%!

Note: To get from root cause analysis to process improvements, the models were used for optimization (another data mining technique).

Next blog: Considerations for the Application of Data Mining.

*Article was first published in **Z Consulting**’s Elevate Manufacturing newsletter for January 2011. *

## How to Use the Statistical Advisor

f you are not certain which particular procedure within *STATISTICA *you should use for a specific problem, consult the *Statistical Advisor*.

The *Statistical Advisor *facility is built into the *STATISTICA *Electronic Manual. It lists a set of questions about the nature of the research problem and the type of data.

For each question, click on the appropriate answer to proceed to the next question or suggestion. If you are unsure of the answer to the question, click on the *Get More Info *link to read more about that particular topic.

Based on your answers to the successive questions about the nature of your research, the *Statistical Advisor *suggests which statistical methods could be used and where to find them in *STATISTICA*.

To access the *Statistical Advisor *from the ribbon bar, select the *Help *tab. In the *Help *group, click *Statistical Advisor*.

To access the *Statistical Advisor *from the classic toolbar, select *Statistical Advisor *from the *Help *menu.

Additionally, you can explore examples for commonly used statistical tests. Click the *commonly used statistical tests *link in the first page of the *Statistical Advisor *to access this topic.