Blog Archives

Completing the value chain: data, insight, action


Thomas Hill, Ph.D.

Thomas Hill, Ph.D. Dell Contributor at Tech Page One

Thomas Hill is Executive Director for Analytics at Dell’s Information Management Group

The value of effective predictive/prescriptive analytics is easily explained: The best and largest storage capabilities, fastest data access and ETL functionality, and most robust hardware infrastructure will not guarantee success in a highly competitive market place. If, however, one can predict what will happen next – how consumer sentiments will shift, which large insurance claim provides opportunities for subrogation, or how specific changes in the manufacturing process will drastically reduce warranty claims in the field – critical actions can be taken yielding competitive advantages that could pay off within weeks or even days for the entire investment required to achieve those insights.

I sometimes like to point out that I have predicted every stock market crash in the past 30 years – after they happened. Obviously, reporting on what happened to gain insight is interesting and perhaps useful, but the value of predicting outcomes and “pre-acting” rather than reacting to those outcomes can be priceless.

I cannot think of a single successful business that is not continuously working to complete the value chain from the collection of data to predictive modeling, and automating mission critical decisions through effective prescriptive decisioning systems, i.e., some (semi-) automated system by which the best pre-actions to anticipated events and outcomes become part of the routine day-to-day operations and SOPs.

There are near infinite numbers of specific examples. I have had the privilege of collaborating with some brilliant visionaries and practitioners on several books around predictive modeling, the analysis of unstructured data, and (in a forthcoming book) on the application of these technologies to optimize healthcare in various ways. These books describe the near-infinite universe of use cases and examples to illustrate what successful businesses and government agencies are doing today.

When good projects go bad

So what are the real challenges to adopting successfully predictive and prescriptive analytics? The biggest challenge in any such project – in order to incorporate these technologies into mission critical processes – is to complete successfully every single step of the value chain, from data collection, to data storage, data preparation, predictive modeling, validated analytic reporting, to providing decisioning support and prescriptive tools to realize value.

There are near infinite numbers of ways by which well-intended and sometimes planned projects can drive off the rails. But in our experience, it almost always has to do with the difficulty to connect to the right data at the right time, to deliver the right results to the right stakeholder within the actionable time interval where the right decision can make a difference, or to incorporate the predictions and prescriptions into an effective automated process that implements the right decisions.

Sometimes, it is an overworked IT department dealing with outdated and inadequate hardware and storage technologies, trying to manage the “prevention of IT” given these other challenges. Sometimes there are challenges integrating diverse data sources that span structured data in relational databases on premise, information that needs to be accessed in the cloud or from internet-based services, with unstructured textual information stored in distributed file systems.

For example, many manufacturing customers of StatSoft need to integrate manufacturing data upstream with final product testing data, and then link it to unstructured warranty claim narratives that capture failures in the field stored in diverse systems. In the financial services industry, in particular the established “brick-and-mortar” players are challenged to build the right systems to capture all customer touch points and connect them with the right prediction/prescription models, to deliver superior services when they are most needed.

So in short, the data may be there, the technologies to do useful things with those data exist (and are comprehensively available in StatSoft’s products), but the two cannot readily be connected. It is generally acknowledged that data preparation consumes about 90% or more of the effort in analytic projects.

Completing the value chain

That is why we are excited at StatSoft to be part of Dell, and why our customers almost immediately “get it”: Dell hardware, combined with the cutting edge tools and technologies in Dell’s software stack, combined with Dell’s thought leaders and effective services across different domains, and now combined with StatSoft’s tools and solutions for predictive and prescriptive analytics deliver the only ecosystem of its kind that can integrate very heterogeneous data sources, and connect them to effective predictive and prescriptive analytics. It does not matter if, as is the case in the real world, these data sources are structured or unstructured, involve multiple data storage technologies and vendors, are implemented on-prem or cloud based. We can deliver solutions based on robust hardware with cutting-edge software and effective and efficient services, combined with the right analytics capabilities to drive effective action.

So pausing for a moment to reflect on this, I cannot really think of any other provider of these capabilities that can complete the data-to-insight-and-action value chain for driving competitive advantages to all businesses small or large. StatSoft’s motto was “Making the World more Productive” which naturally goes with Dell and the Power to do more.

This will be an exciting time going forward for StatSoft and Dell, and our customers.

How to Plot Graphs on Multiple Scales

how-to-articleGraphing is a vital part of any data analysis project. Graphs visually reveal patterns and relationships between variables and provide invaluable information. At times, the patterns may be interesting; however, the scaling of the data can simultaneously interfere with the messages to be conveyed.
When units and scale vary greatly, seeing detail in all variables on a plot becomes quite impossible. This is when you know your multi-variable plot needs multiple, varying scales. Let’s look at our options…
Double Y Plots
Many graphing tools have a Graph type option called Double-Y. This graph type makes it possible for you to select one or more variables associated with the left Y axis and one or more variables to associate with the right Y axis. This is a simple way of creating a compound graph that shows variables with two different scales.
For example, open the STATISTICA data file, Baseball.sta, from the path C:/Program Files/StatSoft/STATISTICA 12/Examples/Datasets. Several of the variables in this example data file have very different scales.
On the Graphs tab in the Common group, click Scatterplot. In the 2D Scatterplots Start Panel, select the Advanced tab. In the Graph type group box, select Double-Y.
STATISTICA 2D Scatterplots
Now, click the Variables button, and in the variable selection dialog box, select RUNS as X, WIN as Y Left, and DP as Y Right. Click the OK button.
Click OK in the 2D Scatterplots Startup Panel to create the plot. The result lists the two Y variables with separately determined scales.
STATISTICA Scatterplot with multiple variables
WIN shows a scale from 0.25 to 0.65. This is the season winning proportion. The variable DP is shown on a scale from 100 to 220 and is the number of double plays in the season. Because of the great difference in the scale of these two variables, a Double-Y plot is the best way to simultaneously show these variables’ relationships with the X factor, RUNS.
Multiple Y Plots
An additional option is available for creating plots with multiple axis scales. This option is used when you need more scales than the Double-Y allows or when you need an additional axis in another place or capacity.
Continuing the same example, add a second variable, BA, to the Y Left variable list.
STATISTICA 2D Scatterplots -- adding a second variable
Click OK to create the new plot.
STATISTICA Scatterplot of multiple variables, sharing left Y axis
Now, WIN and BA share the left Y axis. BA, batting average, is on a scale of .2 to .3. Giving BA a separate Y axis scale would show more detail in the added variable. To do this, right-click in the graph, and on the shortcut menu select Graph Options. Select the Axis – General tab of the Graph Options dialog box.
From the Axis drop-down menu, select the Y left axis. Then click Add new axis. A new Y left axis is added to the plot called Y left’.
STATISTICA Graph Options, with new Y left axis added to plot
Next, the BA variable needs to be related to that axis and customized. Select the Plot – General tab to make this change.
On the Plot drop-down list, select the variable BA. Then, in the Assignment of axis group, select the Custom option button, and specify Y left’ as the custom axis.
STATISTICA Graphi options, customizing second Y left axis
Click OK to update the plot.
STATISTICA Scatterplot with 3 Y variables plotted with custom scaling

The resulting plot now has three Y variables plotted, each with its own Y axis scaling and labeling. Showing patterns and relationships in data of varying scale is made easy with multiple axes.

Heavy Equipment Manufacturing – #Statistica #StatsoftSA-R #Software #Statistics #Engineering

STATISTICA Solutions for Heavy Equipment Manufacturing


Capital Equipment Manufacturers utilize STATISTICA throughout the manufacturing process and then analyze the repair and usage data once their products are in use by customers

Manufacturing / Six Sigma

STATISTICA is an integral part of the quality control and Six Sigma programs at heavy equipment manufacturing organizations. Several of the largest global manufacturing organizations have global, site licenses for STATISTICA, used throughout their manufacturing sites.

Applications range from Web-based monitoring of Quality Control to fairly standard statistical process control techniques to customized STATISTICA-based applications for analyses that are specific to the type of manufacturing being performed.

Warranty Analyses

Capital equipment manufacturers typically provide basic and extended warranties to their customers as a value-added service. The length of warranty to provide and its associated cost for each product are important concerns for these organizations.

It is also helpful from product improvement and repair process improvement perspectives to be able to determine the most frequent repairs by product, the factors that contribute to a failure type, and the correlations between failures (e.g., if the repair technician determines that the water pump needs to be replaced, they may as well replace another component that is also likely to fail).

STATISTICA‘s data mining and text mining algorithms are critical components in the successful setting of warranty parameters and the determination of repair guidelines and rules to decrease warranty service costs.

Remote Monitoring

As a value-added service to their customers, organizations are able to offer remote monitoring services to their customers that deploy data transmission devices on their products and feed data to a centralized database. STATISTICA is integrated with those databases and monitors the various data feeds from the customer’s equipment. For example, the STATISTICA application includes predictive models to monitor oil pressure, RPMs, water pressure and various other equipment parameters. STATISTICA provides automated alerting and exception reporting when the latest data predict a problem or a failure for a piece of equipment. The organization notifies the customer proactively before there is a problem and a decision is made about whether a repair technician should be sent out to make adjustments to the machine.

Sales Analysis / CRM

StatSoft’s customers in the Capital Equipment Industry use the broad base of analytic techniques in the platform to determine regional patterns in their sales and to make cross-selling and up-selling recommendations based upon what an individual customer just purchased, what they already own, the business that the customer is in, the region in which the customer is based, etc.

Data Integration -Connectivity and Data Integration Solutions #dataintegration #statistica

StatSoft’s customers depend on STATISTICA Enterprise for business intelligence and data mining applications that use a wide variety of data from many different sources:

STATISTICA Enterprise allows you maintain the connections to this data, and the analyses of the data, all in one centralized application.

The data sources supported by STATISTICA Enterprise include the following:

  • Any OLE DB or ODBC relational database, such as Oracle, SQL Server, or Access
  • Flat files like Excel spreadsheets and STATISTICA spreadsheets
  • Data historian repositories such as the PI Data Historian from OSI Soft, Inc
  • In-Place Database Processing (IDP), query multidimensional databases containing terabytes of data and process data without importing to local storage

STATISTICA Enterprise also provides convenient tools for filtering your data, and for viewing the metadata associated with those data sources.


The STATISTICA Enterprise administration application (called the Enterprise Manager) allows you to specify multiple users, with passwords, and assign different roles to those different users. For example, your database administrators can be given permission to create and modify database connections and queries, and your engineers can be given permission to run those queries and analyze the resulting data. Furthermore, each engineer can be assigned permissions for only the analyses that pertain to his or her work.


Business Intelligence – Solve a Critical Quality Problem

root cause analysis with predictive analyticsThis is a continuation of  Predictive Analytics – Solve a Critical Quality Problem. A BioPharmaceutical Manufacturing company was scrapping about 30% of batches, which is very expensive. The company’s engineers tried to solve the problem with various techiques.

But it was not until they started using predictive analytics (also know as data mining) that they uncovered actionable process improvements. These improvements are predicted to lower the scrap rate from around 30% to 5%.

How were these improvements discovered?

The Data Mining Approach for Root Cause Analysis: Data mining is a broad term used in a variety of ways, in addition to other terms such as “predictive modeling” or “advanced analytics.”

Here, it means the application of the latest data-driven analytics to build models of a phenomenon, such as a manufacturing process, based on historical data. In a nutshell, in the last 10-15 years, there has been a great leap forward in terms of the flexibility and ease of building models and the amount of data that can be utilized efficiently due to advances in computing hardware.

Data mining has changed the world of analytics

… in a good way.

… forever.

Companies that embrace these changes and learn to apply them will benefit.

Data mining begins with the definition and aggregation of the relevant data. In this case, it was the last 12 months of all the data from the manufacturing process, including:

  • raw materials characteristics
  • process parameters across the unit operation for each batch
  • product quality outcomes on the critical-to-quality responses on which they based their judgment about whether to release the batch or scrap it

Once the relevant data were gathered, StatSoft consultants sat down with the engineering team before we began the model building process. This is a critical step and one that you should consider as you adopt data mining.

We asked the engineers questions such as:

  • Which factors can you control, and which ones can you not control?
  • Which factors are easy to control, and which ones are difficult or expensive to control?

The rationale is that data mining is not an academic exercise when applied to manufacturing. It is being done to improve the process, and that requires action as the end result. A model that is accurate but based solely on parameters that are impossible or expensive to tweak is impractical (which is a nice way of saying ― useless).

Empowered with this information, model building is the next step in the data mining process. In short, many data mining model types are applied to the data to determine which one results in the optimal goodness of fit, such as the smallest residuals between predicted and actual values.

Various methods are employed to ensure that the best models are selected. For example, a random hold-out sample of the historical data is used for each model to make predictions. This helps protect against the potential for the model to get very good at predicting one set of historical data to the point at which it is really bad at predicting the outcomes for other batches.

A major advantage of data mining is that you don‘t need to make assumptions ahead of time about the nature of the data and the nature of the relationships between the predictors and the responses. Traditional least squares linear modeling, such as what is taught in Six Sigma classes on the analytic tools, does require this knowledge.

For Root Cause Analysis, most data mining techniques provide importance plots or similar ways to see very quickly which raw materials and process parameters are the major predictors of the outcomes, and, as valuable, which factors don‘t matter.

root cause analysis BI

At this point in the data mining process, StatSoft consultants sat down with the engineering team to review the most important parameters. Typically, there is an active discussion with comments from the engineers such as:

  • that can‘t be
  • I don‘t see how that parameter would be relevant

The conversation gradually transforms over the course of an hour to:

  • I could see how those parameters could interact with the ones later in the process to impact the product quality

Data mining methods are really good at modeling large amounts of data from lots of parameters, a typical situation in manufacturing. Humans are good at thinking about a few factors at a time and interpreting a limited time window of data.

As shown above, the two approaches complement each other, with the results from data mining as important insights about the manufacturing process that can then be evaluated, validated, and utilized by the engineering team to determine:

  • Now, what do we do to improve the process? What are the priorities?

The company then planned to implement Process improvements that are predicted to lower the scrap rate of batches from ~30% to ~5%!

Note: To get from root cause analysis to process improvements, the models were used for optimization (another data mining technique).