Monthly Archives: January 2013
Written by: STATISTICA
Written by: Angela Waner
Forrester recently published a report on Big Data Predictive Analytics Solutions. They scored StatSoft as a strong performer.
Disclosure: I was not interviewed by Forrester for their report. These are just my personal thoughts based on my work experience and the various news articles that I have read about the report.
Forrester said “StatSoft has a comprehensive number of analysis algorithms and is very strong in manufacturing use cases.”
Algorithms are important to StatSoft, but our focus is on the practical application of algorithms and data visualizations.
We have plenty of math geeks to support this work. If I step out of my office and cross the hallway, I could throw a paper airplane and it would land on a predictive analytics consultant or a statistician.
I agree that we have very strong manufacturing use cases, predictive analytics and quality control. StatSoft’s STATISTICA Enterprise™ platform is common in food, drink and drug manufacturing companies. If you eat cereal, use sugar, drink Pepsi or Coca-Cola, take medication or eat a sausage from the United States, then the odds are good that STATISTICA was used to ensure that product’s quality.
We also have pet food manufacturers as customers. This makes me happy because I own several dogs. I want their food to be safe too.
We have manufacturing customers like Georgia-Pacific (resins), SolarWorld (solar power components), Instrumentation Laboratory (medical devices), Caterpillar (heavy equipment) and many others.
As a project manager, I work on STATISTICA software projects and customer projects. In 2012, I worked on projects for Insurance and Financial companies with the STATISTICA Decisioning Platform®. The other StatSoft project managers were also specialized to industry.
I also know that we have strong insurance and financial uses cases. The problem… I am not allowed to share many of these use cases because of non-disclosure agreements (NDAs). I can share some highlights.
My largest project, in 2012, was completed in December. It involved a very large bank that needed to control their expenses and better manage their model life cycle. They decided to purchase STATISTICA Decisioning Platform and drop SAS.
An insurance project (large datasets, predictive analytics and BI), just finished installing STATISTICA Enterprise in the cloud last week. It was installed without any big challenges. It was installed by the customer. Our installation process for enterprise analytics is straightforward.
My favorite project in 2012 was a collaboration between a European bank and StatSoft employees. The bank had STATISTICA and SAS licenses. They wanted to drop their more expensive SAS licenses that generate reports.
The bank customer asked for an easy method to create crosstabulation reports. And the result is a new module named STATISTICA Reporting Tables.
Based on my work load, I would say that StatSoft is a strong performer in Insurance and Financial Services, too.
Image Credit: http://commons.wikimedia.org/wiki/File:Paperairplane.png
You have put in a lot of hard work generating a scorecard model. Just imagine, you have looked through possibly hundreds of predictor variables and selected those that were most important to your model. You’ve discretized them, looking at weight of evidence to verify that they have been properly prepared for use in developing your model. All of your discretization scripts have been used to prepare your predictors for use in building a logistic regression model which is in turn used to create your scorecard. Now that you have your model, how do you determine if your model performs as you expect?
There is a host of statistics and graphs that you can use to help you determine if your model is performing at the level you expect. There is the Kolmogorov-Smirnov statistic which is a measure of how much the probability distribution of the “goods” differ from the “bads,” and varies from a low of 0 to a high of 1.0. The Gini score reflects the overall unevenness in the relative frequencies of values along the range of scores, or a measure of the predictability of a model, and also ranges from a low of 0 to a high of 1.0. Divergence is a measure of the overall minimum distance between the “goods” and “bads,” and ranges from a low of 1.0 to high positive values. The Hosmer-Lemeshow value is also a form of a minimum distance test incorporating Chi-Square values, and it is evaluated like an ordinary Chi-Square value. The Receiving Operator Characteristic (ROC) curve is created by plotting the true-positive rate (sensitivity) over the false-positive rate (1-specificity). The area underneath the ROC curve varies from a low of 0 to a high of 1.0, the entire area between the axes. Finally a lift chart helps you visualize the effectiveness of a model and is a measure of the effectiveness of a predictive model calculated as the ratio between the results obtained with and without the predictive model.
Needless to say, going through each of these would require more than one blog, and really you need to contrast and compare the results of many of these statistics and graphs to see how well your model is performing. To wet your taste for comparing your models, I’m going to stick to one of these options, the lift chart.
A lift chart is shown above. The X-axis is graduated in terms of deciles, or bins of 10% of the total cases modeled. The Y-axis is graduated in terms of lift index or a factor expressing how much better the model performs in each decile. The model line is plotted by determining the ratio between the results predicted by our model compared to the results using no model.
In the lift chart, you can see that the lift values in the lower deciles are higher than the expected value plotted at 1.0, indicating that the model has a relatively high predictive power. What does this mean? For now let us focus on the 10% decile.
If we contacted 10% of all our customers using no model at all to decide what customers to contact, we could expect a response rate of 10%, with that 10% consisting of positive and negative responses. However, if we used our model to select 10% of our customer base we could expect a response rate of between 22% and 24%. That is a lift of between 2.2 and 2.4, meaning our model performs 2.2 to 2.4 times better than no model at all.
Does it make sense to use the model to select more customers to contact? If you contacted 80% of your customer base with no model you would expect an 80% response rate. With the model being used to select those customers, you could expect a response rate of 96%, but that is only 1.2 times better than no model at all.
Is that lift of 1.2 worth the extra cost of contacting 70% more of your customer base? That’s for you, the content expert, to decide. With the tools made possible to you through lift charts and a whole host of other statistics and graphs for evaluating your scorecard model, you can have the insight on how to make more informed decisions for your company to maximize your profit and reduce your risk.
Companies and other organizations have about 320 business days to upgrade to Windows 7 and Office 2010.
If your organization is upgrading to Windows 7 and you don’t have STATISTICA 9 or a later release, you should upgrade. STATISTICA 8 was tested/validated on Windows Vista. STATISTICA 9 was tested/validated on Windows Vista and Windows 7. We have started validating STATISTICA on Windows 8, but this release will not be available until 2013.
StatSoft has customers who are in the process of upgrading from Windows XP/Office 2003. In particular, I am thinking of global pharmaceutical, medical device, and bio-pharmaceutical customers. Because they are regulated by FDA (and other governmental bodies), they cannot just simply upgrade. The pharma companies have to ensure that use of the software is “validated.” These companies have been planning and have started rolling out the upgrade. They will finish the upgrade process well before April 8, 2014. Yeah!
I have heard some individuals at various companies complain about upgrading to Office 2010. They don’t want to lose their menus. They dislike or hate Ribbons. To these individuals, I suggest watching a 90 minute video that shows the story of the Ribbon.
Microsoft’s development team for Office 2007 had an epiphany. They realized the “user interface was failing our users.” They had “vastly overestimated how well our current user interface was working.” After adjusting to the change, Ribbons work better than menus for the average user.
This video helped me realize the Ribbons were created to help with visual categorization. Human brains are all about naming and categorizing.
After watching this video, I stopped seeing “Ribbon haters” as resistent to change. Maybe they are frustrated with how the functionality is grouped. For these people, it helps to create a “reference sheet” that maps menus to Ribbons. And it helps to acknowledge that their issues are real, but the fact is, Microsoft only provides Ribbons now.
There are also technical reasons why individuals dislike Ribbons. They may have backward compatibility issues. Many organizations created custom menus and they have struggled with customizing Ribbons.
StatSoft does understand the challenge that some individuals have moving to Ribbons. This is why we have menus and Ribbon bars within STATISTICA. By default, STATISTICA will display the Ribbon bar, but it is easy to switch to menus.
Note: StatSoft recommends using Ribbons.
I am changing my blog’s focus for 2013. I am going to write about the “micro stories” that happen with BI and/or predictive analytics projects. These stories are not confidential. They are common problems.
My first story is about me and relates to scoring.
I score. You score. We all score.
Prior to becoming an employee at StatSoft, I was a project manager for software development projects in the travel industry. These projects were implemented globally in contact centers for thousands of users. The users answered phone calls and made reservations for customers who wanted to rent a car or reserve a hotel room. The software was mission critical. And the “change control” process was very painful because the production environment was used 24/7. We had a three-tiered environment with Development, Test and Production servers. The development methodology varied based on the needs of the project; Waterfall, Rapid application development or agile.
When I started working for StatSoft, I thought that my experience with managing “data analytic projects” was minimal.
But after working at StatSoft for a while…. I learned that I had TONS of experience managing “data analytic” and “predictive analytic” projects. I did not know it because of vocabulary.
Contact centers have tons of data. The data was constantly turned into reports. The reports were turned into action. And they do predictive analytics every second of the day. But it was called “employee scheduling,” “best buy,” “forecasting demand,” “yield management.”
I learned that scoring was not just for football or ping pong games. Scoring is for winners.
A data miner consultant (StatSoft employee) might say, “The contact center is live scoring to determine the best rate to offer to their customer who called the contact center. They are scoring the data from the one customer against a mathmathical model (which can be coded into a database or computer program or a WebServer). The score process will evaluate all the variables and return a price.
Image credit: A few scores from the 2007 StatSoft Employee Ping Pong Tournament. Employees play a single elimination to 11 points every year. Nitin won in 2007.
The STATISTICA PI Connector is an add-on product in the STATISTICA suite of analytics software to directly connect to PI Systems within a company’s infrastructure. OSIsoft delivers the PI System, the industry standard in enterprise infrastructure, for management of time series data and events. A global base of more than 14,000 installations across manufacturing, energy, utilities, life sciences, data centers and process industries relies upon the OSIsoft PI System to safeguard data and deliver enterprise-wide visibility into operational and business data in order to manage assets, mitigate risks, improve processes, drive innovation, make business decisions in real time, as well as identify competitive business and market opportunities.
With the STATISTICA PI Connector, data from the PI system can be browsed and imported directly into one of STATISTICA‘s workstation or server products.
STATISTICA leverages data in PI to optimize company’s processes, using data mining technology, and optimization technology that is particularly effective for process optimization.
In the Power Generation industry, for example, StatSoft has completed projects to optimize coal furnaces for stable flame temperatures and lower emissions.
STATISTICA provides the most comprehensive set of analytic tools in any single analysis software package that will directly connect to PI data repositories, to leverage all the data already being collected and managed. The STATISTICA software offers hundreds of different graphs which can be customized and automated, and statistical and data mining methods that span the entire range of useful techniques.
Typical applications include simple and advanced (multivariate) process monitoring, multivariate SPC, real-time batch SPC/QC and drill down (e.g., industries such as pharmaceutical, paper and pulp industries, refining, and power generation). All these capabilities are provided in a client server and web enabled platform or a simple, desktop solution.