Monthly Archives: January 2013

Using Case States to Color or Mark by Category in a Graph

Written by: STATISTICA

In the 2011 Rexer Data Mining Survey, STATISTICA was the victor for the third year in a row as the primary data mining tool used by practitioners. Out of 20 commercial categories included in the survey, STATISTICA took first place in 16. Many of these categories include highly complex areas such as variety of available algorithms, quality and accuracy of model performance, and ability to modify algorithm options to fine-tune analyses.
One of the areas in which STATISTICA also took first place is an area that is sometimes overlooked but is equally important in data analysis: strong graphical visualization of models. Large amounts of data or very complex ideas can be conveyed using a single still image. STATISTICA’s graphical tool set makes it easy to generate images, and the ability to customize these images is limited practically only by the bounds of your imagination.
Box plots by default do not show raw data. The box plot, more formally known as the box-and-whisker plot, is an exploratory graphic. These plots are used to show the distribution of a data set at a glance. Any type of data you might use in a histogram can also easily be used in a box plot. When scatterplots, which do show raw data, are used, there is a Mark Selected Subsets feature available that distinguishes individual data points by group. However, Mark Selected Subsets is not included as an option for box plots. One way to represent individual data points by groups is to use case states to color or mark by category.
Included in STATISTICA is an example data set named Rats.sta, which includes data obtained from rats running a maze in an experiment. The rats were raised in two environments, a Free environment and a Restricted environment. Within each environmental group, there were three strains of rats: Bright, Mixed, and Dull. These rats were used in a maze test, and the number of errors the rats made in completing the test were recorded and included in this data set. To see the distribution of errors made by environmental group, create a box plot with Errors as the dependent variable and Environment as the grouping variable.
StatSoft Box Plot of ERRORS grouped by ENVIRONMENT
By looking at this box plot, it can be seen that restricted-environment rats made more errors than the rats raised in the free environment. However, you are not able to see the raw data in these graphs or the different strains of rats in each group. To see this data, you can use case states to color or mark by category.
To proceed with customizing the box plot, you first need to use case states to assign icons to each separate strain of rat within the data set. With the data set active, click on the Data tab. In the Cases group, click Cases, and from the Case States submenu, select Color or Mark By Category. In the Color or Mark By Category dialog box, click the Change Variable button. In the variable selection dialog box, select the variable Strain, and click OK. Then, click OK in the Color or Mark By Category dialog box. The graphic below shows what the data set will look like once case states are used to color or mark by category.
STATISTICA data set using case states to color or mark by category
This procedure has now assigned an individual color-coded icon to each category of rat included in the variable Strain. These icons can now be used to provide additional details in the box plot graph regarding the strain of rat. To specify the customized box plot, resume the analysis initially started for the box plot. In the initial box plot, the only specifications made were the variable selections of Error as the dependent variable and Environment as the grouping variable.
In the 2D Box Plots dialog box, click the Advanced tab. To include raw data points in the graph, select the Display raw data check box. If you generate a graph with this set of specifications, you will see that the graph still is not as readable as it could be.
STATISTICA Box Plot of ERRORS grouped by ENVIRONMENT
The icons for the different groups of strain overlap one another, and they are not included in the legend. To keep the icons from overlapping, there is another option that needs to be specified: Jitter. On the Advanced tab of the 2D Box Plots dialog box, shown below, you will see the option for Jitter beneath the Display raw data check box. The options specified for Jitter are Random and 50% width.
STATISTICA 2d box plots
Click OK to produce the graph. In the following image, the legend is slightly modified to remove the options for extreme data, and included are the special characters representing the strain of rats. This can be done by double-clicking on the legend and typing, or copying and pasting text, into the Titles/Text dialog box.
STATISTICA Box Plot of ERRORS grouped by ENVIRONMENT
You can now see not only the raw data in the box plot, but also whether each data point belongs to the Bright, Mixed, or Dull strain of rat. With this level of customization, you can see that both the Bright and Dull strains of rats had levels of error in both the high and low extremes of the distribution of error, and this can be seen in both the Free and Restricted environments. However, the Mixed strain had error levels in the lower distributions of the groups, with only one data point close to the median in the Restricted environment. Is there something about the Mixed strain of rat that makes it possible for them to make fewer errors than either the Dull or Bright strains? This level of customization illuminates this relationship and makes it easily distinguishable to whoever now reviews these graphs.
Graphs in STATISTICA can be customized to display complex relationships among variables that may not be easily distinguishable in other formats.

Big Data Predictive Analytics, StatSoft Strong Performer

Written by: Angela Waner

Forrester recently published a report on Big Data Predictive Analytics Solutions. They scored StatSoft as a strong performer.

Disclosure: I was not interviewed by Forrester for their report. These are just my personal thoughts based on my work experience and the various news articles that I have read about the report.

Forrester said “StatSoft has a comprehensive number of analysis algorithms and is very strong in manufacturing use cases.”

Algorithms are important to StatSoft, but our focus is on the practical application of algorithms and data visualizations.

We have plenty of math geeks to support this work. If I step out of my office and cross the hallway, I could throw a paper airplane and it would land on a predictive analytics consultant or a statistician.

smiley

I agree that we have very strong manufacturing use cases, predictive analytics and quality control. StatSoft’s STATISTICA Enterprise platform is common in food, drink and drug manufacturing companies. If you eat cereal, use sugar, drink Pepsi or Coca-Cola, take medication or eat a sausage from the United States, then the odds are good that STATISTICA was used to ensure that product’s quality.

We also have pet food manufacturers as customers. This makes me happy because I own several dogs. I want their food to be safe too.

We have manufacturing customers like Georgia-Pacific (resins), SolarWorld (solar power components), Instrumentation Laboratory (medical devices), Caterpillar (heavy equipment) and many others.

As a project manager, I work on STATISTICA software projects and customer projects. In 2012, I worked on projects for Insurance and Financial companies with the STATISTICA Decisioning Platform®. The other StatSoft project managers were also specialized to industry.

I also know that we have strong insurance and financial uses cases. The problem… I am not allowed to share many of these use cases because of non-disclosure agreements (NDAs). I can share some highlights.

My largest project, in 2012, was completed in December. It involved a very large bank that needed to control their expenses and better manage their model life cycle. They decided to purchase STATISTICA Decisioning Platform and drop SAS.

An insurance project (large datasets, predictive analytics and BI), just finished installing STATISTICA Enterprise  in the cloud last week. It was installed without any big challenges. It was installed by the customer. Our installation process for enterprise analytics is straightforward.

My favorite project in 2012 was a collaboration between a European bank and StatSoft employees. The bank had STATISTICA and SAS licenses. They wanted to drop their more expensive SAS licenses that generate reports.

The bank customer asked for an easy method to create crosstabulation reports. And the result is a new module named STATISTICA Reporting Tables.

cross tabulation

Based on my work load, I would say that StatSoft is a strong performer in Insurance and Financial Services, too.

Image Credit: http://commons.wikimedia.org/wiki/File:Paperairplane.png

Evaluating Scorecard Models

You have put in a lot of hard work generating a scorecard model.  Just imagine, you have looked through possibly hundreds of predictor variables and selected those that were most important to your model.  You’ve discretized them, looking at weight of evidence to verify that they have been properly prepared for use in developing your model.  All of your discretization scripts have been used to prepare your predictors for use in building a logistic regression model which is in turn used to create your scorecard.  Now that you have your model, how do you determine if your model performs as you expect?

There is a host of statistics and graphs that you can use to help you determine if your model is performing at the level you expect.  There is the Kolmogorov-Smirnov statistic which is a measure of how much the probability distribution of the “goods” differ from the “bads,” and varies from a low of 0 to a high of 1.0. The Gini score reflects the overall unevenness in the relative frequencies of values along the range of scores, or a measure of the predictability of a model, and also ranges from a low of 0 to a high of 1.0. Divergence is a measure of the overall minimum distance between the “goods” and “bads,” and ranges from a low of 1.0 to high positive values. The Hosmer-Lemeshow value is also a form of a minimum distance test incorporating Chi-Square values, and it is evaluated like an ordinary Chi-Square value. The Receiving Operator Characteristic (ROC) curve is created by plotting the true-positive rate (sensitivity) over the false-positive rate (1-specificity). The area underneath the ROC curve varies from a low of 0 to a high of 1.0, the entire area between the axes.  Finally a lift chart helps you visualize the effectiveness of a model and is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model.

Needless to say, going through each of these would require more than one blog, and really you need to contrast and compare the results of many of these statistics and graphs to see how well your model is performing.  To wet your taste for comparing your models, I’m going to stick to one of these options, the lift chart.

A lift chart is shown above.  The X-axis is graduated in terms of deciles, or bins of 10% of the total cases modeled. The Y-axis is graduated in terms of lift index or a factor expressing how much better the model performs in each decile. The model line is plotted by determining the ratio between the results predicted by our model compared to the results using no model.

In the lift chart, you can see that the lift values in the lower deciles are higher than the expected value plotted at 1.0, indicating that the model has a relatively high predictive power.  What does this mean?  For now let us focus on the 10% decile.

If we contacted 10% of all our customers using no model at all to decide what customers to contact, we could expect a response rate of 10%, with that 10% consisting of positive and negative responses.  However, if we used our model to select 10% of our customer base we could expect a response rate of between 22% and 24%.  That is a lift of between 2.2 and 2.4, meaning our model performs 2.2 to 2.4 times better than no model at all.

Does it make sense to use the model to select more customers to contact?  If you contacted 80% of your customer base with no model you would expect an 80% response rate.  With the model being used to select those customers, you could expect a response rate of 96%, but that is only 1.2 times better than no model at all.

Is that lift of 1.2 worth the extra cost of contacting 70% more of your customer base?  That’s for you, the content expert, to decide.  With the tools made possible to you through lift charts and a whole host of other statistics and graphs for evaluating your scorecard model, you can have the insight on how to make more informed decisions for your company to maximize your profit and reduce your risk.

April 8, 2014: Support Ends for Windows XP, Internet Explorer 6 and Office 2003

Microsoft Gold Partnership

StatSoft is a Microsoft Gold Partner. We follow Microsoft’s support schedule. STATISTICA‘s standard support ends for Windows XP, IE6 and Office 2003 on April 8, 2014.

Companies and other organizations have about 320 business days to upgrade to Windows 7 and Office 2010.

If your organization is upgrading to Windows 7 and you don’t have STATISTICA 9 or a later release, you should upgrade. STATISTICA 8 was tested/validated on Windows Vista. STATISTICA 9 was tested/validated on Windows Vista and Windows 7. We have started validating STATISTICA on Windows 8, but this release will not be available until 2013.

StatSoft has customers who are in the process of upgrading from Windows XP/Office 2003. In particular, I am thinking of global pharmaceutical, medical device, and bio-pharmaceutical customers. Because they are regulated by FDA (and other governmental bodies), they cannot just simply upgrade. The pharma companies have to ensure that use of the software is “validated.” These companies have been planning and have started rolling out the upgrade. They will finish the upgrade process well before April 8, 2014. Yeah!

I have heard some individuals at various companies complain about upgrading to Office 2010. They don’t want to lose their menus. They dislike or hate Ribbons. To these individuals,  I suggest watching a 90 minute video that shows the story of the Ribbon.

Microsoft’s development team for Office 2007 had an epiphany. They realized the “user interface was failing our users.” They had “vastly overestimated how well our current user interface was working.” After adjusting to the change, Ribbons work better than menus for the average user.

Word Ribbon Bar

This video helped me realize the Ribbons were created to help with visual categorization. Human brains are all about naming and categorizing.

After watching this video, I stopped seeing “Ribbon haters” as resistent to change. Maybe they are frustrated with how the functionality is grouped. For these people, it helps to create a “reference sheet” that maps menus to Ribbons. And it helps to acknowledge that their issues are real, but the fact is, Microsoft only provides Ribbons now.

There are also technical reasons why individuals dislike Ribbons. They may have backward compatibility issues. Many organizations created custom menus and they have struggled with customizing Ribbons.

StatSoft does understand the challenge that some individuals have moving to Ribbons. This is why we have menus and Ribbon bars within STATISTICA. By default, STATISTICA will display the Ribbon bar, but it is easy to switch to menus.

Note: StatSoft recommends using Ribbons.

I Score, You Score, We All Score Data – By: Angela Waner #Scoring

I am changing  my blog’s focus for 2013. I am going to write about the “micro stories” that happen with BI and/or predictive analytics projects. These stories are not confidential. They are common problems.

My first story is about me and relates to scoring.

I score. You score. We all score.

Prior to becoming an employee at StatSoft, I was a project manager for software development projects in the travel industry. These projects were implemented globally in contact centers for thousands of users. The users answered phone calls and made reservations for customers who wanted to rent a car or reserve a hotel room. The software was mission critical. And the “change control” process was very painful because the production environment was used 24/7. We had a three-tiered environment with Development, Test and Production servers. The development methodology varied based on the needs of the project;  Waterfall, Rapid application development or agile.

When I started working for StatSoft, I thought that my experience with managing “data analytic projects” was minimal.

But after working at StatSoft for a while…. I learned that I had TONS of experience managing “data analytic” and “predictive analytic” projects. I did not know it because of vocabulary.

Contact centers have tons of data. The data was constantly turned into reports. The reports were turned into action. And they do predictive analytics every second of the day. But it was called “employee scheduling,” “best buy,” “forecasting demand,” “yield management.”

I learned that scoring was not just for football or ping pong games. Scoring is for winners.

A data miner consultant (StatSoft employee) might say, “The contact center is live scoring to determine the best rate to offer to their customer who called the contact center. They are scoring the data from the one customer against a mathmathical model (which can be coded into a database or computer program or a WebServer). The score process will evaluate all the variables and return a price.

Image credit: A few scores from the 2007 StatSoft Employee Ping Pong Tournament. Employees play a single elimination to 11 points every year. Nitin won in 2007.

Text Mining YouTube: Analyzing Comments with STATISTICA

STATISTICA PI Connector

The STATISTICA PI Connector is an add-on product in the STATISTICA suite of analytics software to directly connect to PI Systems within a company’s infrastructure. OSIsoft delivers the PI System, the industry standard in enterprise infrastructure, for management of time series data and events. A global base of more than 14,000 installations across manufacturing, energy, utilities, life sciences, data centers and process industries relies upon the OSIsoft PI System to safeguard data and deliver enterprise-wide visibility into operational and business data in order to manage assets, mitigate risks, improve processes, drive innovation, make business decisions in real time, as well as identify competitive business and market opportunities.

 

With the STATISTICA PI Connector, data from the PI system can be browsed and imported directly into one of STATISTICA‘s workstation or server products.

STATISTICA PI Connector OSIsoft Plant Information PI data 
historian

STATISTICA leverages data in PI to optimize company’s processes, using data mining technology, and optimization technology that is particularly effective for process optimization.

 

In the Power Generation industry, for example, StatSoft has completed projects to optimize coal furnaces for stable flame temperatures and lower emissions.

STATISTICA provides the most comprehensive set of analytic tools in any single analysis software package that will directly connect to PI data repositories, to leverage all the data already being collected and managed. The STATISTICA software offers hundreds of different graphs which can be customized and automated, and statistical and data mining methods that span the entire range of useful techniques.

Typical applications include simple and advanced (multivariate) process monitoring, multivariate SPC, real-time batch SPC/QC and drill down (e.g., industries such as pharmaceutical, paper and pulp industries, refining, and power generation). All these capabilities are provided in a client server and web enabled platform or a simple, desktop solution.