Monthly Archives: March 2014

The Dell Acquisition: Hot Topic on the Web!

2014-03-27-surprise-computer-200sqIf the sheer volume of this week’s Twitter buzz is any indication, it is clear that Dell Software’s acquisition of Tulsa-based StatSoft (announced three days ago) has surprised, impressed, and befuddled many an observer of the advanced analytics space.

Within hours of Dell’s press release this past Monday morning, plenty of forward-thinking statements and opinions were already being expressed as bloggers and journalists trumpeted the information across social media channels.

For your reading pleasure, here is a short list of just some of the feedback I’ve been able to keep up with. To help you decide what to read, I have taken the liberty of noting what I found to be quick takeaways.

No doubt, we will see more opinions and thoughts published in the coming weeks and months. Naturally, we are pretty excited about the possibilities here at the (former) StatSoft HQ. What are your thoughts on all this?
Advertisements

StatSoft is now part of Dell

dell-announcementStatSoft is proud to announce today that we have joined forces with Dell and Dell’s Information Management Group, one of the largest providers of end-to end BI and analytic solutions in the market. As of today, StatSoft is part of the Dell organization.

End-to-end advanced analytics solutions.  For StatSoft and Dell customers, this means new opportunities and capabilities to enable leading edge analytics technologies to leverage the accelerating growth of data occurring in every industry, to achieve and retain industry leadership. Turning the torrents of data into actionable information is the fundamental mission of StatSoft as well as Dell’s Information Management Group. StatSoft’s big data predictive modeling and data mining solutions for various industries, combined with Dell’s wide range of data management and software capabilities and affordable, leading-edge, and comprehensively supported x86 server platforms can deliver big data analytics at a Dell price-point for unbeatable ROI.

Dell Software already offers a host of tools to manage data and databases across structured and unstructured data sources, including products such as Toad for Oracle, Toad for SQL Server, and Spotlight on SQL Server Enterprise, as well as tools to integrate data and applications distributed across the organizations, including products such as SharePlex and Dell Boomi, the latter of which was recently positioned by Gartner, Inc. in the “Leaders” Quadrant of the Magic Quadrant for Enterprise Integration Platform as a Service.

Making the World More Productive

We are excited to combine with Dell’s shared resources providing myriads of opportunities to leverage StatSoft’s analytic solutions in concert with Dell’s hardware solutions, and by way of its numerous industry relationships, including those with SAP Hana, Oracle, Microsoft SQL and PDW, and Cloudera. We are looking forward to continued growth together with our distinguished list of successful customers in practically every industry, and thank you for your support.

StatSoft Recognized in Magic Quadrant™ Announcement at BI Summit

gartner-summit-200pxInformation technology research and advisory firm Gartner has unveiled its Magic Quadrant for Advanced Analytics Platforms, publicly highlighting StatSoft’s position with top-tier “ability to execute” advanced solutions.

This particular quadrant report, new to Gartner’s offerings, was released February 24 in a brief session at the Gartner Business Intelligence and Information Management Summit in Sydney, Australia.

Darrel Amarasekera, Managing Director of StatSoft Pacific, was among the audience of vendors and executives, whom he described as “enthusiastic [and] very, very attentive” while Gartner Research Director Lisa Kart skimmed through the report’s contents.

Kart specifically drew the audience’s attention to StatSoft’s status as a new entrant among the top three vendors capable of executing advanced solutions. She shared with the audience some of the strengths of the STATISTICA platform. In the downloadable report (sign-in required), these strengths address STATISTICA’s wide range of functionality with a broad variety of data types; high customer satisfaction with advanced descriptive and predictive analytics; and scalability. In addition, StatSoft was reported with some of the highest evaluations for product reliability and upgrade experience, and STATISTICA was most frequently selected for license cost and speed of model deployment.

Previously, Gartner analysts had combined business intelligence with analytics in their annual Magic Quadrant research reports. However, recent industry changes with big data and predictive analytics have prompted them to develop a standalone “Advanced Analytics” category this year.

Gartner clients can access the complete Advanced Analytics Magic Quadrant report online.

STATISTICA reduces emissions spikes & associated costs with #PredictiveAnalytics at coal coking plant, pays for itself in 6 months

emissionsNew Success Story: STATISTICA reduces emissions spikes & associated costs with #PredictiveAnalytics at coal coking plant, pays for itself in 6 months. What about your industry? Click here to read full article.

Mayato Study: STATISTICA Surpasses Top Competitors in User Friendliness, Modern Interface

mayatoMayato Study: STATISTICA Surpasses Top Competitors in User Friendliness, Modern Interface

This past spring, Mayato, a data mining and business analytics consulting company based in Germany, conducted its annual study of data mining tools.

The 2013 study focused on multi-media analytics solutions and pitted several major software vendors against one another. Once again, STATISTICA scored very highly and earned top ranking for user friendliness.

Of over 150 analytics tools on the market, Mayato included STATISTICA among its selection of four data mining suites whose functionality they consider to be comprehensive:

  • StatSoft: STATISTICA Professional 12
  • IBM SPSS Statistics Professional 21
  • SAS Enterprise Guide 5.1
  • Rapid-I: RapidMiner 5.3 / R (open-source)

Each tool had to prove itself in a test scenario covering all phases of a typical analysis project: from data import through the creation of forecasting models (linear regression) to the interpretation of results. Factors affecting the user experience—stability, speed, documentation, and operation—were also evaluated.

Analyst Peter Neckel at ComputerWoche magazine reviewed the study and its competitors in a German-language article published April 25, 2013.

Neckel noted that STATISTICA outstripped the competitive field in the area of user friendliness, thanks to its modern and consistent user interface for all tasks and products. He also expressed appreciation for STATISTICA’s abundant variety of functions, especially regarding the number of available regression, data preparation, and parameterization methods.

Mayato conducted its field test on a sample of real data sets from JustBook, a hotel booking apps provider seeking to distribute its marketing budget efficiently across online and offline channels.

Complete study results are available at http://www.mayato.com.

How to Make Model Deployment Easier than Ever with New Workspace Nodes

how-to-articleOur previous How-To article, How to Deploy Models Using SVB Nodes, covered a topic that is becoming increasingly important, especially in data mining applications with a graphical user interface working with nodes that represent data mining algorithms. Rajiv Bhattarai covered the primary topic of deployment using the original STATISTICA Visual Basic (SVB) nodes. As STATISTICA reflects the rapid advances in technology and makes significant investments to remain a leader in predictive analytics, new nodes have been developed. This is a source of many questions, and this article will help to describe the differences between the scripted SVB nodes and the new STATISTICA Workspace nodes. Further, it will be shown how using the new nodes makes model deployment easier than ever.

As with the previous article, this article assumes that you have a basic understanding of how to navigate through the workspace. If you need a refresher, see How to Navigate the STATISTICA Workspace.
New STATISTICA Workspace Nodes v. Scripted Nodes
As you work with STATISTICA Workspaces, you will see two types of nodes in practice; one is the scripted SVB nodes, which are the nodes described in the previous article and will not be the focus of this article. These are indicated by SVB on node icons, as you will see below. The new nodes are introduced as enhancements of the workspace user interface to closely resemble the interactive user interface in the respective modules. Below you will see a comparison of the Boosted Trees Classification SVB node and the new Boosted Classification Trees node.
Boosted Trees Classification SVB node, STATISTICA screenshot
New Boosted Classification Trees node, STATISTICA 12 screenshot
Describing in detail all the additional features of the new nodes is beyond the scope of this article, but here are some highlights that will be beneficial to discriminate between the SVB and new nodes. A few of the properties of the new nodes are:
  • Before the node is run, it will appear with a yellow background. When the node is run, the background will turn from yellow to clear, an indication that you have completed the analysis.
  • Additional functionality is represented by icons on the node:
    • Nodes are run by clicking the green arrow icon located at the lower-left corner of the analysis node.
    • Parameters can be edited by clicking the grey gear icon at the upper-left corner of the node.
    • Node results can be viewed by clicking the report icon at the upper-right corner of the node.
    • Downstream results are indicated by a document icon at the lower-right corner of the node.
    • Nodes can be connected by clicking the gold diamond icon at the center-right side of the node, holding down, and drawing an arrow to another node where you can release the click, thereby attaching two nodes together.
  • Variable selection can be performed on the analysis node.
  • The functionality of the node closely resembles the functionality of the respective interactive analysis. As you can see with the results options for the Boosted Classification Trees above, in the results alone, you have much more control over what output is provided upon completion of the analysis.
  • Deployment functionality is built into the node.
Deployment Example with New Nodes
For this example using historical data of either Good or Bad credit, representing customers who satisfy or default on their loans respectively, we will build and compare the performance of two models to predict Good or Bad credit from future applicants using both Logistic Regression and Boosted Trees.
Open the data set provided with STATISTICA titled creditscoring.sta.
On the Home tab in the Output group, click Add to Workspace and select Add to New Workspace. In the title bar of the workspace, verify that Beta Procedures is selected.
Selecting Beta Procedures tab in New Workspace, STATISTICA
As new nodes are created for algorithms, and as they are fully tested, they are made available in the All Validated Procedures selection. Boosted Trees Classification is currently available using this option. Logistic Regression is currently in the testing process and is therefore only available within the Beta Procedures area.
Within the data set, there is a variable titled TrainTest that separates the data into a training data set and testing data set. To separate this data into these separate groups, do the following:
On the Data tab in the Manage group, click Subset twice to add two subset nodes into the workspace.  Verify that the subset nodes are connected to the data node. One helpful practice for modifying the workspace in order to clearly keep track of your analyses is to rename nodes according to your selection criteria. Edit the names of the nodes (right-click on the name and select Rename) to represent the training and testing data as illustrated below.
Editing names of nodes in STATISTICA Workspace
To edit the parameters of a node, you can either click the gray gear icon at the upper-left corner of the node or double-click the node. In the Include Cases group box, select the Specific, selected by option button. Enter the expression as shown in the next illustration.
Editing Parameters of new workspace nodes, STATISTICA
Complete the same procedure for the subset node that represents the testing data.
In the workspace illustration above, you can see that the Training subset node has been run since it no longer has a yellow background (run your Training node by clicking the green arrow icon at the lower-left corner of the node). Also, the document icon at the lower-right corner means that there is data available for downstream analysis. Clicking on that document icon will open the available data, and when you scroll to the right of the data file, you can verify that only those cases with TrainTest = “Train” have been selected, indicating you have specified the correct inclusion criteria in the subset node.
Training subset node output, STATISTICA
Close the data set.
On the Data Mining tab in the Trees/Partitioning group, click Boosted Trees and select Boosted Classification Trees. On the Statistics tab, in the Advanced/Multivariate group, click Advanced Models > Generalized Linear/Nonlinear and select GLZ Custom Design (beta). Ensure that both nodes are connected to the Training node.
Connecting new nodes in STATISTICA Workspace
Edit the parameters of the Boosted Classification Trees analysis node and make the variable selections shown below.
Editing parameters, variable selections of analysis node, STATISTICA
In the Boosted Classification Trees dialog box, select the Code Generator tab. Verify that the only selection is for PMML.
Code Generator dialog options, STATISTICA
Leave all other settings at their default values, and click the OK button.
Edit the parameters of the GLZ Custom Design (beta) node. On the Quick tab, select Logit model with a Binomial distribution using the Logit Link function.
Quick tab selections for generalized linear models, STATISTICA
On the Model Specification tab, make the same variable selections as indicated in the analysis node for Boosted Classification Trees, as well as only PMML selected on the Code Generator tab, and click OK.
Run both analysis nodes. There will be a warning displayed when logistic regression is being completed.  Ignore it for the purposes of this example, but for more information about zero pivot element messages in the Generalized Linear Model, see: http://documentation.statsoft.com/STATISTICAHelp.aspx?path=Gxx/Glz/Overviews/ZeroPivotElementDetectedDuringModelFitting. After the analysis computations complete, the workspace will appear as below.
STATISTICA Workspace after running both analysis nodes
To review the results of the analysis on the training data, you could double-click on the Reporting Documents icon. For this example, the focus will be on the performance of these models on the testing data. There are two points that need to be highlighted at this point in the example. The PMML that was generated in our analysis was automatically loaded into the PMML Model nodes, which are downstream of the analysis nodes. Edit the parameters of the PMML Model node that is connected to the Boosted Classification Trees analysis node and select the PMML tab.
PMML script included in node, STATISTICA
You can see that the PMML script that represents this Boosted Classification Trees model is included in this node. Close the Deployment using PMML dialog box.
Connect the Testing subset node to the Rapid Deployment node. The Rapid Deployment node takes the models to which it is connected and applies those models to data to which it is also connected. In this example, it will take the Boosted Classification Trees and Logistic Regression models and apply them to the Testing data.
Run the Testing subset node and verify that you have correctly selected only the Testing data.
Edit the parameters of the Rapid Deployment node. You can review the options of the output from this node outside of this example, but you will find that there is a wide range of output available from including predicted probabilities in the output to ROC curves.
For this example, we will leave all settings at their default values with the exception of the Lift chart settings. On the Lift chart tab, verify that the Lift chart (lift value) check box is selected, with bad as the Category of response.
Rapid Deployement Lift Chart settings, STATISTICA
Run the Rapid Deployment node, which deploys the Boosted Trees and Logistic Regression models onto the Test data. After the node is run, the workspace will appear as below.
Workspace after running Rapid Deployment node on Test data, STATISTICA
To review the results of the Rapid Deployment node, you can either double-click the Reporting Documents nodes, or you can click the document icon at the upper-right corner of the Rapid Deployment node. For this example, review the results by clicking on the appropriate icon on the Rapid Deployment node; this will bring you immediately to the Rapid Deployment results. Select the table of results for Summary of Deployment (Error rates) (Testing).
Table of Results from Rapid Deployment node, STATISTICA
From this table, we can see that the Boosted Trees model had an error rate of 30.5% and the Logistic Regression model had an error rate of 26.3%.  This indicates that at the default settings for the algorithms, the Logistic Regression model performs better than the Boosted Trees model.  In the results folder, select the lift chart.
Lift Chart, STATISTICA
From this chart, we can see that if we applied both models to all the testing data, and took the top 20th percentile of those cases with the highest predicted probability of the classification Bad, the Logistic Regression model will have a lift value of approximately 1.9 while the Boosted Trees model will have a lift value of approximately 1.7. This again confirms that, using the default settings, the Boosted Trees model is outperformed by the Logistic Regression model.

Popular Decision Tree: CHAID Analysis, Automatic Interaction Detection

fp-banners-dnn-customer-churn

The primary goal of churn analysis is to identify those customers that are most likely to discontinue using your service or product. In this dynamic financial industry, companies are progressively providing products and services with similar features. Amidst this ever growing competition, the cost of acquiring a new customer typically exceeds the cost of retaining a current customer. Existing customers are a valuable asset. Furthermore, given the nature of the financial services industry, where customers generally tend to stay with a company for a longer term, churning could lead to substantial revenue loss.

With StatSoft’s Churn Analysis Solution, you can identify customers who are likely to churn by making precise predictions, reveal customer segments and reasons for leaving, engage with customers to improve communication and loyalty, calculate attrition rates, develop effective marketing campaigns to target customers and increase profitability. With STATISTICA’s advanced modeling algorithms and wide array of state-of-the-art tools, you can develop powerful models that can aid in accurate prediction of customer behavior and trends and avoid losing customers.

STATISTICA Solution

  • Batch or Real-Time Processing: Use the models you have built to determine churn and indicate, either by batch or in real-time, the customers who are likely to transfer their business to another company.
  • Cutting-edge Predictive Analytics: STATISTICA provides a wide variety of basic to sophisticated algorithms to build models which provide the most lift and highest accuracy for improved churn analysis.
  • Innovative Data Pre-processing Tools: STATISTICA provides a very comprehensive list of data management and data visualization tools.
  • Integrated Workflow: STATISTICA Decisioning Platform provides a streamlined workflow for powerful, rules-based, predictive analytics where business rules and industry regulations are used in conjunction with advanced analytics to build the best models.
  • Optimized Results: Compare the latest data mining algorithms side-by-side to determine which models provide the most gain. Produce profit charts with ease.
  • Role-Based, Enterprise-Wide Scope: If yours is a multi-user collaborative environment, you can use STATISTICA Enterprise to share data, improve churn models, and benefit from collaborative work with small or large groups.
  • Text Mining Unstructured Data: Improve churn models by using powerful text mining algorithms to incorporate unstructured data currently sitting unused in storage.

Data Death Spiral: Too much categorization stymies decision-making

2014-02-choice-curve-200Perhaps some readers are aware of Sheena Iyengar’s (classic) jam choice study from 1995, in which a grocery market try-before-you-buy display was set up with 24 sample jars of jam, alternated every few hours with a much smaller display of 6 jars. As described in the NY Times, considerably more customers were drawn to the larger display; however, the ratio of buyers was only 1/10 the size of the ratio who bought from the limited 6-jar display. Professor Iyengar hypothesized that “the presence of choice might be appealing as a theory, but in reality, people might find more and more choice to actually be debilitating.”

Certainly, given that the availability of choices does have some value, data categorization is important. But when I ran across Seth Redmore’s recent post about his musical background and the size and scope of musical genres on the market today, I could not believe what he had discovered: a laughably over-zealous list of electronic music categories. Thousands of them.

I am by no means a music industry expert, but it seems clear that when a musician/composer arbitrarily invents a unique name for his personal “brand” of music, such action does not mean a new genre has officially come into being. After all, we are talking about classification of “unstructured” content here (i.e., music), not a scientific taxonomy. As a practical matter in the real world where decisions are made, the differentiation of these so-called genres and sub-genres exists only in the minds of the (likely self-absorbed) composers who coined their names.

From a data collection standpoint, the more categories assigned, the greater the chance of miscategorization, misinterpretation, and confusion. This would only hinder the “shared understanding” Mr. Redmore says can be achieved with data categorization, even if music providers claim such categorization is intended to help consumers find exactly what they want.

My counter-intuitive point here (and maybe Redmore’s, too) is that the consumer cannot possibly know what he wants when faced with so many non-standardized music choices with ridiculously similar genre names like ritual ambient v. black ambient v. doom ambient v. drone ambient v. deep ambient v. death ambient. Mr. Redmore even mentions Netflix with its nearly 77K movie categories! From a marketing standpoint, that is crazy–There is simply no practical reason to attempt the creation of big data where such breadth is detrimental to decision-making. And this would be true whether in the online music room or in the executive board room.