Blog Archives
Completing the value chain: data, insight, action
Thomas Hill, Ph.D. Dell Contributor at Tech Page One
The value of effective predictive/prescriptive analytics is easily explained: The best and largest storage capabilities, fastest data access and ETL functionality, and most robust hardware infrastructure will not guarantee success in a highly competitive market place. If, however, one can predict what will happen next – how consumer sentiments will shift, which large insurance claim provides opportunities for subrogation, or how specific changes in the manufacturing process will drastically reduce warranty claims in the field – critical actions can be taken yielding competitive advantages that could pay off within weeks or even days for the entire investment required to achieve those insights.
I sometimes like to point out that I have predicted every stock market crash in the past 30 years – after they happened. Obviously, reporting on what happened to gain insight is interesting and perhaps useful, but the value of predicting outcomes and “pre-acting” rather than reacting to those outcomes can be priceless.
I cannot think of a single successful business that is not continuously working to complete the value chain from the collection of data to predictive modeling, and automating mission critical decisions through effective prescriptive decisioning systems, i.e., some (semi-) automated system by which the best pre-actions to anticipated events and outcomes become part of the routine day-to-day operations and SOPs.
There are near infinite numbers of specific examples. I have had the privilege of collaborating with some brilliant visionaries and practitioners on several books around predictive modeling, the analysis of unstructured data, and (in a forthcoming book) on the application of these technologies to optimize healthcare in various ways. These books describe the near-infinite universe of use cases and examples to illustrate what successful businesses and government agencies are doing today.
When good projects go bad
So what are the real challenges to adopting successfully predictive and prescriptive analytics? The biggest challenge in any such project – in order to incorporate these technologies into mission critical processes – is to complete successfully every single step of the value chain, from data collection, to data storage, data preparation, predictive modeling, validated analytic reporting, to providing decisioning support and prescriptive tools to realize value.
There are near infinite numbers of ways by which well-intended and sometimes planned projects can drive off the rails. But in our experience, it almost always has to do with the difficulty to connect to the right data at the right time, to deliver the right results to the right stakeholder within the actionable time interval where the right decision can make a difference, or to incorporate the predictions and prescriptions into an effective automated process that implements the right decisions.
Sometimes, it is an overworked IT department dealing with outdated and inadequate hardware and storage technologies, trying to manage the “prevention of IT” given these other challenges. Sometimes there are challenges integrating diverse data sources that span structured data in relational databases on premise, information that needs to be accessed in the cloud or from internet-based services, with unstructured textual information stored in distributed file systems.
For example, many manufacturing customers of StatSoft need to integrate manufacturing data upstream with final product testing data, and then link it to unstructured warranty claim narratives that capture failures in the field stored in diverse systems. In the financial services industry, in particular the established “brick-and-mortar” players are challenged to build the right systems to capture all customer touch points and connect them with the right prediction/prescription models, to deliver superior services when they are most needed.
So in short, the data may be there, the technologies to do useful things with those data exist (and are comprehensively available in StatSoft’s products), but the two cannot readily be connected. It is generally acknowledged that data preparation consumes about 90% or more of the effort in analytic projects.
Completing the value chain
That is why we are excited at StatSoft to be part of Dell, and why our customers almost immediately “get it”: Dell hardware, combined with the cutting edge tools and technologies in Dell’s software stack, combined with Dell’s thought leaders and effective services across different domains, and now combined with StatSoft’s tools and solutions for predictive and prescriptive analytics deliver the only ecosystem of its kind that can integrate very heterogeneous data sources, and connect them to effective predictive and prescriptive analytics. It does not matter if, as is the case in the real world, these data sources are structured or unstructured, involve multiple data storage technologies and vendors, are implemented on-prem or cloud based. We can deliver solutions based on robust hardware with cutting-edge software and effective and efficient services, combined with the right analytics capabilities to drive effective action.
So pausing for a moment to reflect on this, I cannot really think of any other provider of these capabilities that can complete the data-to-insight-and-action value chain for driving competitive advantages to all businesses small or large. StatSoft’s motto was “Making the World more Productive” which naturally goes with Dell and the Power to do more.
This will be an exciting time going forward for StatSoft and Dell, and our customers.
A New Gold Rush Is On. Who Will Strike It Rich?
Original article by Michael Dell of Dell
Data is arguably the most important natural resource of this century. Top thinkers are no longer the people who can tell you what happened in the past, but those who can predict the future. Welcome to the Data Economy, which holds the promise for significantly advancing society and economic growth on a global scale.
Big data is big news just about everywhere you go these days. Here in Texas, everything’s big, so we just call it data. But we’re all talking about the same thing—the universe of structured data, like transactional information in databases, and also the unstructured data, like social media, that exists in its natural form in the real world.
Organizations of all sizes are trying to figure out how to use all of this data to deliver a better customer experience and build new business models. Consumers are struggling to balance a desire for automated, personalized services with the need for safety. Governments are pressured to use all available data in support of national security, but not at the expense of citizens’ right to privacy. And underlying it all is the realization that data, if managed, secured and leveraged properly, is the pathway to progress and economic success.
So who will strike it rich in this new, data-driven gold rush? It will invariably be those who are willing to accept the new realities of the Data Economy. Business instincts and intuition are being augmented and increasingly replaced by data analysis as the drivers of success. We’ve seen it at Dell. Our marketing team uncovered more than $310 million in additional revenue last year through the use of advanced analytics. This year, we expect that number to exceed half-a-billion.
We believe that’s just the tip of the iceberg, and we’re accelerating our strategy. Recently we announced the acquisition of StatSoft, a leading provider of data mining, predictive analytics and data visualization solutions. It is yet another investment in our enterprise solutions, software and services portfolio specifically designed to help our customers turn data into action.
But contrary to popular opinion, the data economy isn’t just for global enterprises like Dell. A Dell-commissioned study that we will announce later this month found that mid-market companies are increasingly investing in data projects to drive better decision making and better business results. We have also found that startups that use technology more effectively create twice as many jobs on average and are more productive and profitable than companies that don’t.
At their core, entrepreneurs are all about solving problems, and nothing provides a better window into problems than data. Consider the popularity of Global Positioning System (GPS) technologies. The simple act of connecting to and delivering data paved the way for many successful businesses that in turn created an entirely new segment of the economy.
The day is near when the use of data analytics will simply become the price of remaining viable and competitive in the global marketplace. There is still a lot of uncertainty about the Data Economy, but this much is clear: the opportunity for data-driven organizations is golden.
Independent Report Labels StatSoft as “Double Victor” Based on Vision, Viability, and Value
StatSoft Recognized by Analysts as Industry Leader in Predictive Analytics
Hurwitz’s Victory Index Labels StatSoft as “Double Victor”
Organizations are adopting, integrating and utilizing predictive analytics at an incredible rate. The business value of predictive analytics is clear: it enables organizations to define and attract the most profitable customers, streamline their resourcing and supply chain, improve the quality and targeting of their products, and many other applications. The Victory Index for predictive analytics, developed by Hurwitz & Associates, is designed to help organizations with an analysis of vendors and solutions for predictive analytics software. Hurwitz labeled StatSoft as a “Double Victor” based on its strong presence in the market, a solid vision, impeccable customer service, and great value for lower total cost of ownership.
The Victory Index is a valuable tool that companies can use to better understand predictive analytics and how that company can become a key player in a highly competitive market. The report shows where each of the leading vendors fall within the designated categories so that companies can capitalize on the experts and their Index rankings in the field of predictive analytics.
Click here to view the full report.
HIV/AIDS Statistics in South Africa
Statsoft Southern Africa Research supplies STATISTICA software to the Department of Health and the entire Wits Health Faculty which includes MRC (Medical Research Council) who also specialize in collecting data for HIV/AIDS related research.
Below is a detailed technical report containing analytics and the HIV/AIDS forecast for South Africa.
In his presentation last week, Minister of Health, Dr Aaron Motsoaledi, presented a troubling picture of the HIV epidemic showing our failure as a country to turn the tide against this unrelenting virus, Quoting data from STATS SA, Department of Health, Department of Home Affairs, World Health Organization, Medical Research Council, Human Science Research Council (HSRC) and a series of articles on Health in South Africa published in the prestigious medical journal The Lancet, he described how HIV has spread to every corner of our country, has impacted on TB and the health services, and dramatically shortened life expectancy. The central point was that an unacceptably large number of South Africans have not been able to realise their dream of a better life in a free and democratic South Africa due to the scourge of AIDS and that this situation cannot be allowed to continue any longer.
To see the rest of the full detailed technical report, please click here
Original Source: http://www.health-e.org.za
Data Mining & StatSoft Power Solutions
Analytics supported
by third party report
The non-profit Electric Power Research Institute (EPRI) recently conducted a study of the StatSoft technology to determine its suitability for optimizing the performance (heat-rate, emissions, LOI) in an older coal-fired power plant. EPRI ordered from StatSoft an optimization project to be conducted under scrutiny of their inspectors.
Using nine months worth of detailed 6-minute interval data describing more than 140 parameters of the process, EPRI found that process data analysis using STATISTICA is a cost-effective solution for optimizing the use of current process hardware to save cost and reduce emissions.
Overview of the Approach
StatSoft Power Solutions offer solution packages designed for utility companies, for optimizing power plant performance, increasing the efficiency, and reducing emissions. Based on over 20 years of experience in applying advanced data-driven, data mining optimization technologies for process optimization in various industries, these solutions will allow power plants to get the most out of their equipment and control systems, by leveraging all data collected at your site to identify opportunities for improvement, even for older designs such as coal-fired Cyclone furnaces (as well as wall-fired or T-fired designs).
Opportunities for Data Driven Strategies to Improve Powerplant Performance
Many (most) power generation facilities are collecting “lots of data” into dedicated historical process data bases. (such as OSI PI) However, in most cases, only simple charts and “after-the-fact” ad-hoc analyses are performed on a small subset of those data; most information is simply not used.
For example, for coal fired power plants, our solutions can help you identify optimum settings for stoichiometric ratio, primary/tertiary air flows, secondary air biases, distribution of OFA (overfired air), burner tilts and yaw positions, and other controllable parameters to reduce NOx, CO, and LOI, without requiring any re-engineering of existing hardware.
What is Data Mining? Why Data Mining?
Data Mining is the term used to describe the application of various machine learning and/or pattern recognition algorithms and techniques, to identify complex relationships among observed variables. These techniques can reveal invaluable insights when the data contain meaningful information which is “hidden” deep inside your data set, and cannot be identified with simple methods. Advanced data mining can reveal those insights by processing many variables and complex interrelations between them, all at the same time.
Unlike CFD, data mining allows you to model the “real world” from “real data,” describing your specific plant. Using this approach, you can:
- Identify from among hundreds or even thousands of input parameters those that are critical for low-emissions efficient operations
- Determine the ranges for those parameters, and combinations of parameter ranges that will result in robust and stable low-emissions operations, without costly excursions (high-emissions events, unscheduled maintenance and expensive generation roll-backs).
These results can be implemented using your existing closed-or-open loop control system to achieve sustained improvements in power plant performance, or you can use StatSoft MultiStream to create a state-of-the-art advanced process monitoring system to achieve permanent improvements.
How is this Different from “Neural Nets” for Closed Loop Control?
One frequently asked question is: How do these solutions differ from neural networks based computer programs that can control critical power plant operations in a closed loop system (an approach used at some plants, often with less than expected success)?
The answer is that, because those systems are based on relatively simple, traditional neural networks technology which typically can only simultaneously process relatively few parameters, they are not capable of identifying the important parameters from among hundreds of possible candidates, and they will not identify specific combinations of parameter ranges (“sweet spots”) that make overall power plant operations more robust.
The cutting-edge technologies developed by StatSoft Power Solutions will not simply implement a cookie-cutter approach to use a few parameters common to all power plants to achieve some (usually only very modest) overall process performance improvements. Instead, our approach allows you to take a fresh look at all your data and operations, to optimize them for best performance. This will allow you to focus your process monitoring efforts, operator training, or automation initiatives only on those parameters that actually drive boiler efficiency, emissions, and so on at your plant and for your equipment.
What we are offering is not simply another neural net for closed loop control; instead, it provides flexible tools based on cutting-edge data processing technologies to optimize all systems, and also provides smart monitoring and advisory options capable of predicting problems, such as emissions related to combustion optimization or maintenance issues.
Contact StatSoft Southern Africa for more information about our services, software solutions, and recent success stories. lorraine@statsoft.co.za
Generalized Additive Models (GAM)
The methods available in Generalized Additive Models are implementations of techniques developed and popularized by Hastie and Tibshirani (1990). A detailed description of these and related techniques, the algorithms used to fit these models, and discussions of recent research in this area of statistical modeling can also be found in Schimek (2000).
Additive Models
The methods described in this section represent a generalization of multiple regression (which is a special case of general linear models). Specifically, in linear regression, a linear least-squares fit is computed for a set of predictor or X variables, to predict a dependent Y variable. The well known linear regression equation with m predictors, to predict a dependent variable Y, can be stated as:
Y = b0 + b1*X1 + … + bm*Xm
Where Y stands for the (predicted values of the) dependent variable, X1through Xm represent the m values for the predictor variables, and b0, and b1 through bm are the regression coefficients estimated by multiple regression. A generalization of the multiple regression model would be to maintain the additive nature of the model, but to replace the simple terms of the linear equation bi*Xi with fi(Xi) where fi is a non-parametric function of the predictor Xi. In other words, instead of a single coefficient for each variable (additive term) in the model, in additive models an unspecified (non-parametric) function is estimated for each predictor, to achieve the best prediction of the dependent variable values.
Generalized Linear Models
To summarize the basic idea, the generalized linear model differs from the general linear model (of which multiple regression is a special case) in two major respects: First, the distribution of the dependent or response variable can be (explicitly) non-normal, and does not have to be continuous, e.g., it can be binomial; second, the dependent variable values are predicted from a linear combination of predictor variables, which are “connected” to the dependent variable via a link function. The general linear model for a single dependent variable can be considered a special case of the generalized linear model: In the general linear model the dependent variable values are expected to follow the normal distribution, and the link function is a simple identity function (i.e., the linear combination of values for the predictor variables is not transformed).
To illustrate, in the general linear model a response variable Y is linearly associated with values on the X variables while the relationship in the generalized linear model is assumed to be
Y = g(b0 + b1*X1 + … + bm*Xm)
where g(…) is a function. Formally, the inverse function of g(…), say gi(…), is called the link function; so that:
gi(muY) = b0 + b1*X1 + … + bm*Xm
where mu-Y stands for the expected value of Y.
Distributions and Link Functions
Generalized Additive Models allows you to choose from a wide variety of distributions for the dependent variable, and link functions for the effects of the predictor variables on the dependent variable (see McCullagh and Nelder, 1989; Hastie and Tibshirani, 1990; see also GLZ Introductory Overview – Computational Approach for a discussion of link functions and distributions):
Normal, Gamma, and Poisson distributions:
Log link: f(z) = log(z)
Inverse link: f(z) = 1/z
Identity link: f(z) = z
Binomial distributions:
Logit link: f(z)=log(z/(1-z))
Generalized Additive Models
We can combine the notion of additive models with generalized linear models, to derive the notion of generalized additive models, as:
gi(muY) = Si(fi(Xi))
In other words, the purpose of generalized additive models is to maximize the quality of prediction of a dependent variable Y from various distributions, by estimating unspecific (non-parametric) functions of the predictor variables which are “connected” to the dependent variable via a link function.
Estimating the Nonparametric Function of Predictors via Scatterplot Smoothers
A unique aspect of generalized additive models are the non-parametric functions fi of the predictor variables Xi. Specifically, instead of some kind of simple or complex parametric functions, Hastie and Tibshirani (1990) discuss various general scatterplot smoothers that can be applied to the X variable values, with the target criterion to maximize the quality of prediction of the (transformed) Y variable values. One such scatterplot smoother is the cubic smoothing splines smoother, which generally produces a smooth generalization of the relationship between the two variables in the scatterplot. Computational details regarding this smoother can be found in Hastie and Tibshirani (1990; see also Schimek, 2000).
To summarize, instead of estimating single parameters (like the regression weights in multiple regression), in generalized additive models, we find a general unspecific (non-parametric) function that relates the predicted (transformed) Y values to the predictor values.
A Specific Example: The Generalized Additive Logistic Model
Let us consider a specific example of the generalized additive models: A generalization of the logistic (logit) model for binary dependent variable values. As also described in detail in the context of Nonlinear Estimation and Generalized Linear/Nonlinear Models, the logistic regression model for binary responses can be written as follows:
y=exp(b0+b1*x1+…+bm*xm)/{1+exp(b0+b1*x1+…+bm*xm)}
Note that the distribution of the dependent variable is assumed to be binomial, i.e., the response variable can only assume the values 0 or 1 (e.g., in a market research study, the purchasing decision would be binomial: The customer either did or did not make a particular purchase). We can apply the logistic link function to the probability p (ranging between 0 and 1) so that:
p’ = log {p/(1-p)}
By applying the logistic link function, we can now rewrite the model as:
p’ = b0 + b1*X1 + … + bm*Xm
Finally, we substitute the simple single-parameter additive terms to derive the generalized additive logistic model:
p’ = b0 + f1(X1) + … + fm(Xm)
An example application of the this model can be found in Hastie and Tibshirani (1990).
Fitting Generalized Additive Models
Detailed descriptions of how generalized additive models are fit to data can be found in Hastie and Tibshirani (1990), as well as Schimek (2000, p. 300). In general there are two separate iterative operations involved in the algorithm, which are usually labeled the outer and inner loop. The purpose of the outer loop is to maximize the overall fit of the model, by minimizing the overall likelihood of the data given the model (similar to the maximum likelihood estimation procedures as described in, for example, the context of Nonlinear Estimation). The purpose of the inner loop is to refine the scatterplot smoother, which is the cubic splines smoother. The smoothing is performed with respect to the partial residuals; i.e., for every predictor k, the weighted cubic spline fit is found that best represents the relationship between variable k and the (partial) residuals computed by removing the effect of all other j predictors (j ¹ k). The iterative estimation procedure will terminate, when the likelihood of the data given the model can not be improved.
Interpreting the Results
Many of the standard results statistics computed by Generalized Additive Models are similar to those customarily reported by linear or nonlinear model fitting procedures. For example, predicted and residual values for the final model can be computed, and various graphs of the residuals can be displayed to help the user identify possible outliers, etc. Refer also to the description of the residual statistics computed by Generalized Linear/Nonlinear Models for details.
The main result of interest, of course, is how the predictors are related to the dependent variable. Scatterplots can be computed showing the smoothed predictor variable values plotted against the partial residuals, i.e., the residuals after removing the effect of all other predictor variables.
This plot allows you to evaluate the nature of the relationship between the predictor with the residualized (adjusted) dependent variable values (see Hastie & Tibshirani, 1990; in particular formula 6.3), and hence the nature of the influence of the respective predictor in the overall model.
Degrees of Freedom
To reiterate, the generalized additive models approach replaces the simple products of (estimated) parameter values times the predictor values with a cubic spline smoother for each predictor. When estimating a single parameter value, we lose one degree of freedom, i.e., we add one degree of freedom to the overall model. It is not clear how many degrees of freedom are lost due to estimating the cubic spline smoother for each variable. Intuitively, a smoother can either be very smooth, not following the pattern of data in the scatterplot very closely, or it can be less smooth, following the pattern of the data more closely. In the most extreme case, a simple line would be very smooth, and require us to estimate a single slope parameter, i.e., we would use one degree of freedom to fit the smoother (simple straight line); on the other hand, we could force a very “non-smooth” line to connect each actual data point, in which case we could “use-up” approximately as many degrees of freedom as there are points in the plot. Generalized Additive Models allows you to specify the degrees of freedom for the cubic spline smoother; the fewer degrees of freedom you specify, the smoother is the cubic spline fit to the partial residuals, and typically, the worse is the overall fit of the model. The issue of degrees of freedom for smoothers is discussed in detail in Hastie and Tibshirani (1990).
A word of Caution
Generalized additive models are very flexible, and can provide an excellent fit in the presence of nonlinear relationships and significant noise in the predictor variables. However, note that because of this flexibility, you must be extra cautious not to over-fit the data, i.e., apply an overly complex model (with many degrees of freedom) to data so as to produce a good fit that likely will not replicate in subsequent validation studies. Also, compare the quality of the fit obtained from Generalized Additive Models to the fit obtained via Generalized Linear/Nonlinear Models. In other words, evaluate whether the added complexity (generality) of generalized additive models (regression smoothers) is necessary in order to obtain a satisfactory fit to the data. Often, this is not the case, and given a comparable fit of the models, the simpler generalized linear model is preferable to the more complex generalized additive model. These issues are discussed in greater detail in Hastie and Tibshirani (1990).
Another issue to keep in mind pertains to the interpretability of results obtained from (generalized) linear models vs. generalized additive models. Linear models are easily understood, summarized, and communicated to others (e.g., in technical reports). Moreover, parameter estimates can be used to predict or classify new cases in a simple and straightforward manner. Generalized additive models are not easily interpreted, in particular when they involve complex nonlinear effects of some or all of the predictor variables (and, of course, it is in those instances where generalized additive models may yield a better fit than generalized linear models). To reiterate, it is usually preferable to rely on a simple well understood model for predicting future cases, than on a complex model that is difficult to interpret and summarize.
The 13th Annual KDnuggets™ Software Poll – STATISTICA
“For the first time, the number of users of free/open source software exceeded the number of users of commercial software. The usage of Big Data software grew five-fold. R, Excel, and RapidMiner were the most popular tools, with StatSoft STATISTICA getting the top commercial tool spot.” – KDnuggets.com
The 13th Annual KDnuggets™ Software Poll asked: What Analytics, Data Mining, or Big Data software have you used in the past 12 months for a real project (not just evaluation)?
This May 2012 poll attracted “a very large number of participants and used email verification” to ensure one vote per respondent. Once again, StatSoft’s STATISTICA received very high marks, earning “top commercial tool” in this poll.
Complete poll results and analysis can be found at http://www.kdnuggets.com/2012/05/top-analytics-data-mining-big-data-software.html.
KDnuggets.com is a data mining portal and newsletter publisher for the data mining community with more than 12,000 subscribers.
Poll Results: Statsoft Statistica becoming the most popular commercial tool!
For the first time, the number of users of free/open source software exceeded the number of users of commercial software. The usage of Big data software grew five-fold. R, Excel, and RapidMiner were the most popular tools, with Statsoft Statistica getting the top commercial tool spot.
The 13th annual KDnuggets Software Poll asked:
About 28% used commercial software but not free software, 30% used free software but not commercial, and 41% used both.
The usage of big data tools grew five-fold: 15% used them in 2012, vs about 3% in 2011.
R, Excel, and RapidMiner are the most popular tools, with Statsoft Statistica becoming the most popular commercial tool, getting more votes from SAS (in part due to more active campaign from Statsoft users, and lack of such campaign from SAS).
Among those who wrote analytics code in lower-level languages, R, SQL, Java, and Python were most popular.
This poll also had a very large number of participants and used email verification and other measures to remove unnatural votes (*see note below).
What Analytics, Data mining, Big Data software you used in the past 12 months for a real project (not just evaluation) [798 voters] | |
Legend: Free/Open Source tools Commercial tools |
% users in 2012 % users in 2011 |
R (245) | 30.7% 23.3% |
Excel (238) | 29.8% 21.8% |
Rapid-I RapidMiner (213) | 26.7% 27.7% |
KNIME (174) | 21.8% 12.1% |
Weka / Pentaho (118) | 14.8% 11.8% |
StatSoft Statistica (112) | 14.0% 8.5% |
SAS (101) | 12.7% 13.6% |
Rapid-I RapidAnalytics (83) | 10.4% not asked in 2011 |
MATLAB (80) | 10.0% 7.2% |
IBM SPSS Statistics (62) | 7.8% 7.2% |
IBM SPSS Modeler (54) | 6.8% 8.3% |
SAS Enterprise Miner (46) | 5.8% 7.1% |
Orange (42) | 5.3% 1.3% |
Microsoft SQL Server (40) | 5.0% 4.9% |
Other free analytics/data mining software (39) | 4.9% 4.1% |
TIBCO Spotfire / S+ / Miner (37) | 4.6% 1.7% |
Oracle Data Miner (35) | 4.4% 0.7% |
Tableau (35) | 4.4% 2.6% |
JMP (32) | 4.0% 5.7% |
Other commercial analytics/data mining software (32) | 4.0% 3.2% |
Mathematica (23) | 2.9% 1.6% |
Miner3D (19) | 2.4% 1.3% |
IBM Cognos (16) | 2.0% not asked in 2011 |
Stata (15) | 1.9% 0.8% |
Bayesia (14) | 1.8% 0.8% |
KXEN (14) | 1.8% 1.4% |
Zementis (14) | 1.8% 3.7% |
C4.5/C5.0/See5 (13) | 1.6% 1.9% |
Revolution Computing (11) | 1.4% 1.4% |
Salford SPM/CART/MARS/TreeNet/RF (9) | 1.1% 10.6% |
Angoss (7) | 0.9% 0.8% |
SAP (including BusinessObjects/Sybase/Hana) (7) | 0.9% not asked in 2011 |
XLSTAT (7) | 0.9% 0.9% |
RapidInsight/Veera (5) | 0.6% not asked in 2011 |
11 Ants Analytics (4) | 0.5% 5.6% |
Teradata Miner (4) | 0.5% not asked in 2011 |
Predixion Software (3) | 0.4% 0.5% |
WordStat (3) | 0.4% 0.5% |
Among tools with at least 10 users, the tools with the highest increase in “usage percent” were
- Oracle Data Miner, 4.4% in from 2012, up from 0.7% in 2011, 505% increase
- Orange, 5.3% from 1.3%, 315% increase
- TIBCO Spotfire / S+ / Miner, 4.6% from 1.7%, 169% increase
- Stata, 1.9% from 0.8%, 130% increase
- Bayesia, 1.8% from 0.8%, 115% increase
The three tools with highest decrease in usage percent were 11 Ants Analytics, Salford SPM/CART/MARS/TreeNet/RF, and Zementis. Their dramatic decrease is probably due to vendors doing much less (or nothing) to encourage their users to vote in 2012 as compared to 2011.
Note: 3 tools received less than 3 votes and were not included in the above table: Clarabridge, Megaputer Polyanalyst/TextAnalyst, Grapheur/LIONsolver.
Big Data
Big data tools use grew 5-fold, from about 3% to about 15% of respondents.
Big Data software you used in the past 12 months | |
Apache Hadoop/Hbase/Pig/Hive (67) | 8.4% |
Amazon Web Services (AWS) (36) | 4.5% |
NoSQL databases (33) | 4.1% |
Other Big Data Data/Cloud analytics software (21) | 2.6% |
Other Hadoop-based tools (10) | 1.3% |
We also asked about the popularity of the individual languages for data mining. Note that we also included R in this table, as well as among higher-level tools
Your own code you used for analytics/data mining in the past 12 months in: | |
R (245) | 30.7% |
SQL (185) | 23.2% |
Java (138) | 17.3% |
Python (119) | 14.9% |
C/C++ (66) | 8.3% |
Other languages (57) | 7.1% |
Perl (37) | 4.6% |
Awk/Gawk/Shell (31) | 3.9% |
F# (5) | 0.6% |
For comparison here are the recent software polls:
- KDnuggets 2011 Poll: Data Mining/Analytic Tools Used
- KDnuggets 2010 Poll: Data Mining / Analytic Tools Used
- KDnuggets 2009 Poll: Data Mining Tools Used.
Vote: cleaning: To reduce multiple voting this poll used email verification, which reduced the total number of votes compared to 2011, but made results more representative.
Furthermore, some vendors were much more active than others in recruiting their users, and to give a more objective picture of the tool popularity, a large number (over 100) of the “unnatural” votes were removed, leaving 798 votes.