Monthly Archives: October 2012

Creating an Interactive Analysis Using STATISTICA Enterprise

STATISTICA Enterprise can be used for automated analyses and reports as well as interactive analyses. One of the main strengths of the STATISTICA Enterprise tool is the analysis templates, called analysis configurations, which automate and streamline various analysis tasks. Often, an analysis should be performed interactively, with a person to guide the project, opposed to automated results. When this is the case, Enterprise users can take advantage of existing data configurations to perform these interactive analyses with an ad hoc analysis configuration.

Standard methods of obtaining data include opening a STATISTICA spreadsheet, importing data such as Excel or a text file, or querying a database. STATISTICA Enterprise users have an additional option, the ad hoc analysis configuration. This option automates the query of the data by using a data configuration, saving time and ensuring consistency. Bringing in the right data for analysis may involve a complex query that, for example, joins multiple tables, uses specific selection criteria, etc. The ad hoc analysis configuration is a very easy way to take advantage of the work already performed for accessing the needed data.
What is an ad hoc analysis configuration? This is an analysis configuration in STATISTICA Enterprise that simply pulls in the data needed for analysis using a data configuration. Then, the analyst can work in STATISTICA interactively to perform the needed tasks.
How can I create an ad hoc analysis? First, the ad hoc analysis needs a data configuration that queries the needed data. Either create a new data configuration or use an existing one. (See an example of setting up a data configuration.)
After the data configuration is created, right-click on the folder containing the data configuration in the left pane of STATISTICA Enterprise Manager, and from the menu select New Analysis Configuration.

STATISTICA Enterprise Manager

Depending on the other uses of the data configurations, a dialog box may be displayed that contains the message The Data Configuration does not have any characteristics. Do you want to continue?

STATISTICA Enterprise Manager no characteristics

This message occurs because many analysis configurations are used to produce quality control charts where variable characteristics are required. This is not a requirement of ad hoc analysis. Click the Yes button to continue.
Next, permissions should be set for the analysis being created. A dialog box prompts you to set permissions, giving you the option to use the same permissions of the data configuration. Choose the appropriate option for your needs.

STATISTICA Enterprise permissions dialog box

Next, the new analysis configuration is created and added to the tree panel in STATISTICA Enterprise Manager. Change the analysis Type to Ad hoc Analysis.

STATISTICA Enterprise Manager select ad hoc analysis

Commit the change and the ad hoc analysis configuration is created. It is now ready for use in STATISTICA or STATISTICA Enterprise Server.
How can I use the ad hoc analysis? In STATISTICA, select the Enterprise tab. In the Enterprise group, click Run Analysis/Report. In the Run Analysis or Report dialog box, select the newly created ad hoc analysis.

STATISTICA Run Analysis or Report dialog box

Click the OK button.
In STATISTICA Enterprise Server, there is a similar option. From the Enterprise menu, select Component manager to display the Enterprise dialog box.

STATISTICA component manager dialog box

Click Run Analysis or Report, and browse to the newly created ad hoc analysis configuration. Select the ad hoc analysis, and click the Run Analysis or Report button.

STATISTICA web run analysis dialog box

In either tool, desktop STATISTICA or STATISTICA Enterprise Server, after running the analysis, the data are brought into a spreadsheet and ready for analysis.
For more information email lorraine@statsoft.co.za

How Capable is Your Process?

Written by: Todd Ellingson

When monitoring a process, it’s critical to know if that process is capable of meeting the required specifications. If process variability is high compared to the range of your customers’ specifications, then you will end up with lots of scrap. That’s bad.

But what can we do?
I’m glad you asked. It turns out there’s a statistic for this exact situation. It’s called the Process Capability statistic, abbreviated Cp. (Don’t ask me why it’s Cp and not PC, I guess dyslexia has been around a long time).
Cp is the width of the specification limits divided by the width of the process, i.e., divided by the process variability. The larger Cp is, the more capable the process is of meeting the specification requirements. The XBar & R chart above shows a Cp statistic of 1.022 in the box at the upper-left.
For example, it’s desirable to have a Cp value greater than one. This indicates the width of the specification limits (the numerator) is wider than the width of the process variability (the denominator). In other words, the process is operating within the specification limits.
It’s interesting to know the history of how Cp is calculated. Calculating Cp requires calculating a standard deviation as an intermediate step. But back in the day, before computers were widespread, it was fairly time consuming to calculate something like a standard deviation.
So 50 years ago this was all a hassle. Then some smart people figured out you could estimate the standard deviation using a function of the range, i.e., the maximum minus the minimum. And the range is easy to calculate. As a result, the Cp statistic was almost always calculated using a function of the range to estimate the standard deviation.
Today computers are everywhere, and it’s easy to calculate the standard deviation directly. But old habits die hard, and the traditional range method is still the most common way to calculate Cp.
With STATISTICA, you can calculate Cp either way (using the range method, or calculating the standard deviation directly). But in keeping with convention, the range method is used by default, as is done in most (probably all) other statistical software.

Teaching Macros to Play Nice with the other Data Sets

Written by: Jennifer Thompson

Macros and automation can save so much time, ulcers, gray hair, etc. In STATISTICA, creating macros to automate tasks is as easy as hitting record on your DVR. They are fast, and they make sure the analysis is done consistently with the same options and analysis procedures. One drawback, when comparing macros to people, is macros don’t think for themselves. They just run the script, even if the analysis it is performing is absurd. So they may need taught to play nice with their data sets.

Recorded macros reference the variable’s position in the spreadsheet. The macro can be run on this data set as often as needed, always giving the proper results. If a new variable is added or one is deleted, the position of the variables for analysis may change.  A new data set may also organize the variables differently. If this is the case, running the macro will no longer give the expected results. Where the macro was supposed to build a regression model between X and Y, it is now treating X as dependent and using ID number as the independent variable. Chaos ensues.
What if there was an easy way to customize the macro to protect against this potential problem? I’ll give you two. The easy way is to simply remove the variable reference and the then you will select variables each time the macro is run. The more involved method takes a few steps to have the macro reference the variables by name, but the resulting macro runs without your input.

Method 1: Delete Variable Reference

Delete the variable selection portion of the macro. Variable selection in recorded macros behaves basically the same way regardless of analysis or graphing procedures used. The macro selects variables similarly to how it’s done in this example. The variable selection line is oAD2.Variables = “3-5”. Variables 3 through 5 will be used for this descriptive statistics analysis.

Macro 1

By deleting 3-5, leaving empty quotes, the variable position reference is removed. Now running the macro will prompt you to select appropriate variables. As is typical of life, taking the easy option now, means more work in the future. Each time you run this macro, you will need to select variables for analysis. The plus side is that the macro is compatible with any data set now.

Macro - Select Variables

Method 2: Customizing Macro to Reference by Name

The alternative is to add some custom, yet simple, programming to reference the variables by the name. Then variables can be added, deleted, rearranged, etc., and the macro will still use the proper variables. The macro can be used on a new spreadsheet with the same variable names with no issues.
I will not claim that my way is the only way that works, or that it is the best way. (It may be the best way.) If nothing else, this example can spark ideas for improving your custom macros in STATISTICA!
I started with a loop to access the variable names and pick out the three variable names of interest. You may recognize them from an example data set, Adstudy.sta. They are MEASURE01, MEASURE02, and MEASURE03. Then I modified the variable selection line to use references based on variable names. The steps are listed below.
  1. Reference the spreadsheet, S1.
  2. Create placeholders, v1, v2 and v3, for the 3 variable positions.
  3. Create an array, VarList, for storing the variable names.
  4. In a loop, find the spreadsheet position of each of the variables and store them in the placeholders created in step 2.

Macro - Add Loop

  1. Modify the recorded macro, variable selection line, using the variable position variables.
  2. Delete any unnecessary lines of recorded code for simplicity.*

Macro - Modify Recorded section

Now running the macro will select the variables named MEASURE01, MEASURE02, and MEASURE03, regardless of their position in the spreadsheet. The macro is ready to play nice with other data sets.
 *Note: Recorded macros list all available options and settings. This makes modifying the macro much easier later on. It is easier to change an option from “True” to “False” than to come up with the line of code to access that option. For the purpose of showing others how a macro works, it helps to clean up the code, removing unnecessary parts. So I removed the lines of the recorded macro that simply set the default settings. The macro works the same with or without these lines.

World Gold Council publishes conflict-free gold standard

Picture by: Duane Daws

GOLD

World Gold Council publishes conflict-free gold standard
The World Gold Council (WGC) on Thursday published the conflict-free gold standard, which aims to curb gold production fuelling conflict and human rights violations. The standard, which would apply to conflict-affected areas globally, was developed in collaboration with the council’s member companies, which comprise the world’s leading gold producers. Full Article

Business Intelligence Career, STATISTICA Enterprise

Written by: Angela Waner

I define “business intelligence” (BI) as transforming data into actionable information with computer-based tools. I did not realize it until much later, but BI was my first job out of college. And in many ways, BI is my job now. I work every day with my company’s business intelligence solution, STATISTICA Enterprise.

So back to my first job…I was hired as a software developer. Because I was the newest employee, I inherited a thankless task that no one wanted to do. I became the “report guru” and I quickly learned the mainframe language Easytrieve Plus. This language was actually created so that analyses and reports could be quickly generated on mainframes.

I was in charge of 50 scheduled analyses/reports. About 10 ran once a day. About 20 reports were generated every Monday. And the remaining reports were generated once a month. It was my job to make sure the analyses/reports were executed as soon as the data was available.

I also had to read and understand every report. I “validated” the analyses results as being reasonable. If I saw anything unusual, I had to investigate and fix it before I turned the reports over to management.

Every morning I had to summarize the 10 daily reports. I created a “dashboard” of KPIs (key performance indicators). Excel was my best friend.

(I know that some people will not see my activities as BI, but my work met the spirit of the definition. I was using a mainframe and Excel. These are computer-based tools. And I tried to automate as many tasks as possible.)

Occasionally I would see changes ripple through the company from the analyses, reports, and dashboards that I created. I felt powerful. I felt useful.

But many times management would respond to the “dashboard” by asking for an ad-hoc analysis that sliced the data differently. Or they would ask me for my interpretation of the KPIs. And they would want this information yesterday. I felt stressed.

Data moved too slowly into actionable information.

My first employer was focused on answering questions like:

What happened?

Why did it happen?

But management really wanted answers to questions like:

Why is it happening right now? (monitoring)

What might happen? (predicting)

I did not have the ability to provide this.

Eventually I left programming, became a parent, and changed careers. I have been a project manager at StatSoft since 2005. When I started my employment at StatSoft, I left the “analyze my data with Excel” environment. I joined an environment with enterprise analytics (templates, reporting, monitoring, and dashboards) and predictive analytics.

It has been an interesting adventure. I learned how to use data mining software. I learned how to create templates for my analyses and reports. And I plan on learning more about text mining. I feel empowered.

Independent Report Labels StatSoft as “Double Victor” Based on Vision, Viability, and Value

StatSoft Recognized by Analysts as Industry Leader in Predictive Analytics
Hurwitz’s Victory Index Labels StatSoft as “Double Victor”
Organizations are adopting, integrating and utilizing predictive analytics at an incredible rate. The business value of predictive analytics is clear: it enables organizations to define and attract the most profitable customers, streamline their resourcing and supply chain, improve the quality and targeting of their products, and many other applications. The Victory Index for predictive analytics, developed by Hurwitz & Associates, is designed to help organizations with an analysis of vendors and solutions for predictive analytics software. Hurwitz labeled StatSoft as a “Double Victor” based on its strong presence in the market, a solid vision, impeccable customer service, and great value for lower total cost of ownership.
The Victory Index is a valuable tool that companies can use to better understand predictive analytics and how that company can become a key player in a highly competitive market. The report shows where each of the leading vendors fall within the designated categories so that companies can capitalize on the experts and their Index rankings in the field of predictive analytics.

 

Click here to view  the full report.

Data Mining with STATISTICA – Session 1

Joining Data Tables in STATISTICA

Written by: Jennifer Thompson

When data come from multiple sources, such as database tables, it can become necessary and beneficial to join or merge those tables to get the maximum information from our data. In this article, I will look at ways to bring data sources together easily in STATISTCA.

Specifically, we will show how inner and outer joins in queries can achieve this goal. Then we will show how the merge tool in STATISTICA can do the same tasks, both via interactive dialog boxes and the workspace.

Joins in Queried Data

When data reside in databases, a query is needed to bring the data into STATISTICA for analysis. During the query, functions such as an inner join or outer join can bring the data together. When joining two or more tables, a reference field from each table is needed. An inner join returns records where matches were found on this reference field in both tables. Records are discarded when a match from the other table is not found. (Inner joins can be built in STATISTICA with the GUI Query Builder tool.) For an outer join, records without a match in the joining table are returned. This is based on the type of outer join used: left, right or full.  (Outer joins in STATISTICA queries can be performed in Text Mode in the query tool.) See the simple example below:

sample data tables

Inner join results are shown below. Only complete records are returned, they were found in both tables.

ID Data First Name Last Name
1 65 Sally Smith
5 45 Joe Jones

 

This join was built with the GUI STATISTICA Query tool as seen here.

Query inner join

Full outer join results are shown below. All records are returned from both tables.

ID Data First Name Last Name
1 65 Sally Smith
2 35    
4 86    
5 45 Joe Jones
6   Pete Adams

 

These results were found with this query statement, seen below:

Merging Data Interactively

The same concepts can be used with data already found in STATISTICA spreadsheets. The Merge tool found on the Data tab in the Manage group can do this as well. In the Merge Options dialog box, the two data spreadsheets are selected with the File 1 and File 2 buttons. Then, change the Mode to Match variables. For an inner join style of merge, select the Unmatched Cases option, Delete cases. For an outer join style of merge, use the default Unmatched Cases option, Fill with MD.

Merge interactively

Merging Data in the Workspace

This merge task can also be performed in the STATISTICA Workspace. First, both data tables should be inserted into the workspace by clicking Data Source. The column to be used for joining the data tables should be selected as the Dependent, continuous variable in each data source.

Then, using the Node Browser, select the Comparing and Merging Multiple Data Sources folder to find the Merge Variables node.

node browser

Next, we need to edit parameters of the Merge Variables node. Double click the node to display the Edit Parameters dialog box. Change the Mode to Relational. For an inner style join, the Unmatched cases should Delete, for an outer style join, Fill with MD.

When the selections are made, run the workspace to create the joined spreadsheet.

These operations are essential to creating the needed tables for analysis. With STATISTICA, you have several paths to choose from for meeting the end goal.

 

How to Find Confidence Intervals for a Single Proportion

In statistics, sample data is often used to help find estimates of population parameters. Common parameters that experimenters try to estimate include population means, standard deviations, and proportions. Estimates called confidence intervals are used to estimate these parameters.

 

What Is a Confidence Interval?

The sample statistics (or point estimates) – such as the mean, standard deviation, proportion, etc. – are used to make inference about a population based on a random sample from that population. The point estimate likely does not equal the population parameter it estimates, but should be close. The confidence interval is a range around the point estimate that has a specific probability of containing the population parameter, typically 0.95 for a 95% confidence interval. The confidence interval gives a better estimate of the population parameter of interest because it gives the idea of the range in which the population parameter is.

Confidence Intervals for Single Means and Standard Deviations in STATISTICA

In STATISTICA, you can use the Descriptive Statistics analysis available via the Basic Statistics module to find confidence intervals for a single mean or single standard deviation. To access this analysis, first open a data file, and then select the Statistics tab. In the Base group, click Basic Statistics.

In the Basic Statistics and Tables Startup Panel, select Descriptive Statistics and click OK to display the Descriptive Statistics dialog box. The options for the confidence intervals for the mean and standard deviation are on the Advanced tab. You can specify the confidence level for each via the respective Interval edit box.

Statistica interval edit box

You would then click the Summary button to get the requested statistics, which would include these confidence intervals.

Using STATISTICA to Find a Confidence Interval for a Single Proportion

The Descriptive Statistics analysis is useful for finding statistics regarding continuous data. Proportions are not continuous, but counts. Tools such as Frequency Tables and Tables and Banners can find proportions. You can find a confidence interval for a single proportion using the Power Analysis module. This module is often used to calculate statistical power for a given analysis or to calculate the sample size required to attain a certain power level for a given analysis, but it can also be used to calculate, for a given analysis type, specialized confidence intervals not generally available in the general-purpose statistical packages.

Confidence Interval for a Single Proportion Example

In this example, researchers took a sample of 500 randomly selected subjects who completed four years of college. They found that 75 of them smoked on a regular basis. Thus, the sample proportion (often designated as ) of people who smoked and had a four-year college education is 75/500=0.15 (or 15%). If we wanted an estimate of the true proportion (usually designated as p) of people who smoke that have a four-year education, we could construct a confidence interval for the proportion.

The simplest and most commonly used formula for this type of confidence interval relies on approximating the binomial distribution with a normal distribution (the proportion is binomial because the person sampled either smoked or did not smoke). The formula is:

statistica binomial distribution formula

where z₁-α⁄2 is the 1-α⁄2 percentile of the standard normal distribution; α is the Type I error rate and is the complement of the confidence level.  Thus, for a 95% confidence level, the error α is 5% or 0.05.

This z-score can be calculated within STATISTICA. On the Statistics tab in the Base group, click Basic Statistics to display the Basic Statistics and Tables Startup Panel. Select Probability calculator.

Statistica probability calculator

Click OK to display the Probability Distribution Calculator.

In the Distribution field, select Z (Normal). Select the Inverse, Two-tailed, and (1-cumulative p) check boxes. We are using α = 0.05, so enter this value for p. Click the Compute button to calculate the z critical value (which is given in the X edit field). It is found to be 1.959964, which is commonly rounded to 1.96.

Statistica probability distribution calculator

Thus, the confidence interval for the true proportion is 0.15-1.96*sqrt[(0.15)(0.85)/500] < p < 0.15+1.96*sqrt[(0.15)(0.85)/500]→0.11870131 < p < 0.18129869.

Finding the Confidence Interval in STATISTICA

As previously mentioned, we can find this same confidence interval for a single proportion using the Power Analysis module in STATISTICA.

With any data file opened, select the Statistics tab. In the Advanced/Multivariate group, click Power Analysis. In the Power Analysis and Interval Estimation Startup Panel, select Interval Estimation as the analysis category, and then select One Proportion, Z, Chi-Square Test as the analysis type.

Statistica Power Analysis and Interval Estimation

Click OK.

In the Single Proportion: Interval Estimation dialog box, enter 0.15 for Observed Proportion p, 500 for Sample Size (N), and 0.95 for Conf. Level.

Statistica single proportion interval estimation

Click Compute to calculate the confidence interval.

Statistica data interval estimation

The Pi (Crude) results should match what was calculated earlier by hand as these are the estimates using the normal approximation to the binomial distribution (note that the hand calculations could be off a little due to rounding the z critical value to 1.96; STATISTICA will carry this out to more decimals for better accuracy).

The results in the Interval Estimation spreadsheet also include two other ways to calculate the confidence interval for a proportion – Pi (Exact) (the confidence intervals are the “exact, Clopper-Pearson” confidence intervals) and Pi (Approximate) (the confidence intervals employ a score method with a continuity correction). For more information on how these two methods are computed, see methods 4 and 5 from Robert Newcombe’s paper, Two-Sided Confidence Intervals for the Single
Proportion: Comparison of Seven Methods
(1998, Statistics in Medicine, 17, 857-872).

Conclusion

Sometimes a researcher wants to estimate the true proportion of a population of interest by finding the confidence interval for that proportion. In STATISTICA, the Power Analysis module provides the means to find this estimate.

StatSoft Offers Free Enterprise Analytics Software to Businesses in Struggling European Economies

OCTOBER 15, 2012, TULSA, OK USA: StatSoft, Inc., one of the world’s largest providers of
analytics software, is taking its corporate motto to a whole new level by offering STATISTICA
Enterprise™ solutions (including its cutting-edge Big Data Predictive Analytics Platform) at no
charge to companies in countries most affected by the European economic downturn: Greece,
Portugal, and Spain.

StatSoft’s motto, “Making the World More Productive™,” reflects its core belief that business
analytics and big data processing are key to the productivity of every growing company. The
sour economies in some countries, however, have made it impossible for struggling businesses
to afford the very software solutions that could help them increase productivity, streamline
operations, and achieve safety, quality, and environmental improvements. So, for a limited
time, StatSoft is offering its powerful Enterprise and Predictive Analytics solutions for free to
qualifying businesses.

“Given StatSoft’s recent growth in other international markets, we are very pleased to be in a
position to help our corporate neighbors in Greece, Portugal, and Spain whose well-known
capabilities are being undermined by regional economic conditions,” notes StatSoft’s CEO, Dr.
Paul Lewicki.

“The highly educated work force in those economies is fully capable of taking advantage of the well-demonstrated, tangible productivity improvements and savings that our modern predictive analytics software can offer. Paradoxically, however, the shortage of available credit prevents them from acquiring the crucial technology that would vastly speed up their recovery. Our goal is to make sure these companies succeed, because we firmly believe that analytics can change the world for the better.”
This large scale initiative adds new meaning to the well-known business term ROI (Return on
Investment). StatSoft will serve first those companies whose infrastructure development it
deems would produce the largest economic return, in terms of social and employment benefits.
Lewicki anticipates other companies will follow suit.

“We hope for a snowball effect, where our leadership will prompt other companies to do the same as us,” he explains, “thus helping reduce the current risk for the Euro and speeding up the European recovery.”
As reflected in case studies and success stories at http://www.statsoft.com/customers/successstories/,
STATISTICA Enterprise installations of any scope can result in huge dividends that help business enterprises not only survive but thrive, even while regional economies may be slow to
recover.

Those companies in Spain and Portugal interested in this free software opportunity must
contact the StatSoft Iberica office. Companies in Greece are welcome to contact any StatSoft
office in Europe (found at https://www.statsoft.com/contact-us/statsoft-locations-map/) or the
US Headquarters.
ABOUT STATISTICA AND STATSOFT, INC.
StatSoft was founded in 1984 and is now one of the world’s largest providers of analytics
software, with 30 offices around the globe and more than one million users of STATISTICA
software. StatSoft’s solutions enjoy an extremely high level of user satisfaction across
industries, as demonstrated in the unprecedented record of top ratings in practically all
published reviews and large, independent surveys of analytics users worldwide. With its
comprehensive suite of STATISTICA solutions for a wide variety of industries, StatSoft is a
trusted partner of the world’s largest organizations and businesses (including most of the
Fortune 500 companies), providing mission-critical applications that help them increase
productivity, control risk, reduce waste, streamline operations, achieve regulatory compliance,
and protect the environment.
http://www.statsoft.com/company/
http://www.statsoftiberica.com

For more information contact:
Paul Hiller
918-749-1119 x270
philler@statsoft.com