Monthly Archives: October 2013
Reading articles and news stories about the data mining that companies and governments conduct on routine correspondence, telephone calls, purchases, and all varieties of habits such as web searches, shopping and personal preferences is certainly enough to make one paranoid.
Certainly I can understand people being upset upon learning about how government agencies are spying on routine activities. In fact I cannot help but think about the Foundation series by Isaac Asimov. Perhaps I am the only one who sees a similarity between the powers of data mining “big data” and the branch of mathematics from the Foundation books called “psychohistory.”
In the Foundation series (especially the original trilogy: Foundation, Foundation and Empire, and Second Foundation) mathematician Hari Seldon used the laws of mass action to predict the future on a large scale. His predictions worked on the principle that the behavior of a mass of people is predictable if the quantity of people was very large (like the population of the universe).
Of course this is not the same thing as Target predicting which customers are expecting babies so they can be sent special promotional offers at a key “flux” point in their purchasing habits but it does ring of many of the same concepts.
Perhaps I need to dig out my old copies of the Foundation trilogy for a good read over the weekend.
Interactions between drugs is a growing problem. Particularly as the population age and the number of drugs (or over the counter treatments and supplements) prescribed increases it is more and more likely that interactions between one or more drugs could result in serious side effects. Of course the FDA keeps records of reported interactions and we all hope that our pharmacist is keeping an eye out for known interactions. But recent studies have shown that there is another source of finding hidden drug interactions: web searches.
Researchers from Stanford University in California mined the FDA’s database of adverse drug effects and found that two commonly used drugs – the antidepressant paroxetine and pravastatin, used to lower cholesterol – when used in combination puts patients at risk to develop diabetes.
The research into refining how to conduct such an analysis and the work from analyzing the FDA’s adverse drug effects database resulted in 47 previously unreported drug-drug interactions being detected and reported.
This is great, but of course many adverse effects may be noticed by patients and not mentioned to a doctor let alone reported to the FDA. That is where the new research on web searches come in. The researchers from Stanford analyzed web search logs from 2010 (before the paroxetine-pravastatin interaction was discovered) and found that there was a clear spike in searches of symptoms of diabetes and these drug names.
It is interesting and useful that researchers found a combination of drugs and symptoms that when searched together indicated a drug-drug interaction. But what I find most interesting is that the researchers were able to mine the web search logs and detect the drug-drug interaction even when the searches of the drug names and symptoms were made separately, even days or weeks apart.
This type of data mining could be an important part of the arsenal to detect drug-drug interactions earlier and before more people suffer from the interactions.
The best marketing departments have always maintained their primary focus on the desires of the customer. Now, Customer Relationship Management technology has made it increasingly more feasible to find out exactly what the customer wants, when they want it, how they want it delivered, and why they want it.
Targeted marketing has never been more possible or more vital in today’s unpredictable marketplace. Savvy marketing departments use STATISTICA Data Miner to mine for:
- Customer profiles
- Opportunities for cross-selling and up-selling
- Effective marketing campaign strategies
- Optimal inventory and packaging
STATISTICA Data Miner Tools can help you find:
- Customers who respond to new products
- Customers who respond to discounts
- Customers who buy in specific product categories
- Your most loyal customers
- Geographic or other differences
- Each marketing campaign’s ROI
- And much more
- Enterprise wide solution: A multi-user, role based, secure STATISTICA Enterprise platform allows for a truly collaborative environment to build, test and deploy the best possible models.
- Enhanced Text Analytics: STATISTICA provides an advanced text miner tool to better leverage unstructured/textual data.
- Cutting-edge Predictive Analytics: STATISTICA provides a wide variety of basic to sophisticated algorithms to build models which provide the most lift and highest accuracy for customer insight.
- Innovative Data Pre-processing Tools: STATISTICA provides a very comprehensive list of data management and data visualization tools.
- Powerful Statistical Tools: STATISTICA provides you with an arsenal of the most power statistical tools available.
- Reflexive models for real–time needs: Use Live Score® to process new issues as they occur, and update your models in turn-around times made possible only by STATISTICA’s integrated solutions.
- Integrated Workflow: STATISTICA Decisioning platform provides a streamlined workflow for powerful, rules-based, predictive analytics where business rules and industry regulations are used in conjunction with advanced analytics to build the best models.
Market basket analysis is the study of items that are purchased (or otherwise grouped) together in a single transaction or multiple, sequential transactions. Understanding the relationships and the strength of those relationships is valuable information that can be used to make recommendations, cross-sell, up-sell, offer coupons, etc. The analysis reveals patterns such as that of the well-known study which found an association between purchases of diapers and beer.
Transactional databases can be quite large and cumbersome. Extracting valuable information requires the right toolset. STATISTICA provides advanced analytic tools to explore market baskets and find actionable information therein. The rules uncovered with Sequence Association and Link Analysis can be deployed to new databases to show the probability of purchasing item x, given item y is in the shopping cart.
In this plot, items that occur together in transactions are shown. The strength of the relationships is conveyed in the thickness of the line. The frequency of the item is conveyed in the size of the circle that represents a product.
- Full Range of Solutions: Preparing data for analysis, feature selection, model selection, model evaluation and deployment. STATISTICA incorporates all these into one software package.
- The most powerful algorithms available: STATISTICA incorporates not only traditional data mining algorithms such as CART,CHAID etc. but also other powerful algorithms like Boosted Trees, Random Forests, Neural Networks for sophisticated propensity models.
- Sequence and Association analysis: Find and quantify associations between purchases and behaviors with STATISTICA’s Sequence Association and Link Analysis tool. What items are frequently purchased together, subsequently, etc. Uncover these valuable patterns to improve marketing strategies.
- Enterprise-Wide Convenience: Build models in one department, test models in another department, and then start scoring in offices worldwide. STATISTICA Enterprise™ is a truly collaborative tool that leverages the full power of the best market basket analysis tools available.
- Reflexive models for real–time needs: Use Live Score® to process new customers as they come in, and update your market basket models in turn-around times made possible only by STATISTICA’s integrated solutions.
Propensity and Best-Next-Action Modeling
More companies have started investing more time and money on predictive analytics in order to understand their customers’ behaviors in new ways. By analyzing cross-referenced customer profiles and purchase histories, these companies can predict the likelihood, or propensity, of future activity at a customer-specific level and then proceed with best next actions as necessary. For instance, anticipating future purchases enables a company to recommend (Cross-Sell /Upsell ) suitable products to customers.
Anticipating customer loss to competitors (i.e., churn), on the other hand, enables a company to intervene in hopes of increasing retention rates. All these profiles, purchase histories, best-next-action recommendations, and resolutions require large and potentially cumbersome transactional databases, so extracting valuable information requires the right toolset.
- Full Range of Solutions: Preparing data for analysis, feature selection, model selection, model evaluation and deployment. STATISTICA incorporates all these into one software package.
- Highly Sophisticated Tools: Build Propensity Models or Customer Behavior Scoring Models to predict future behavior of customers. Highly sophisticated and robust tools are also available for performing Recency, Frequency and Monetary (RFM) Analysis of previous purchases to understand customer behavior and define market segments.
- Sequence and Association Analysis: Find and quantify associations between purchases and behaviors (e.g., what items are frequently purchased together, subsequently, etc.) with STATISTICA’s Sequence Association and Link Analysis tool. Uncover these valuable patterns to improve marketing strategies.
- Enterprise-Wide Convenience: Build models in one department, test models in another department, and then start scoring in offices worldwide. STATISTICA Enterprise™ is a truly collaborative tool that leverages the full power of the best propensity models available.
- Reflexive Models for Real–Time Needs: Live Score® processes new transactions as they happen and updates propensity models in rapid turn-around times made possible only by STATISTICA’s integrated solutions.
Highlights from the 2013 Data Miner Survey were released by Rexer Analytics at the 2013 Predictive Analytics World conference on September 30. Complete results will be released later this year.
For the fourth time in a row, StatSoft’s flagship platform, STATISTICA, received both the highest rating in overall user satisfaction as well as the highest selection as “primary” data mining tool overall from among commercial data mining packages.
Dr. Karl Rexer presented his summary findings with charts showing 95% of STATISTICA users reported being “satisfied” or “extremely satisfied” with STATISTICA. Of these respondents, about two-thirds had applied the “extremely satisfied” ranking to STATISTICA, representing a ratio considerably greater than that earned even by the second-place tool. In fact, STATISTICA‘s “extremely satisfied” ratio was more than double that of most commercial tools in the survey.
Significantly, no data miners recorded any level of “dissatisfaction” or “extreme dissatisfaction” with the STATISTICA tool.
STATISTICA‘s selection as primary commercial data mining tool overall was determined by those respondents with access to multiple data mining tools who most often select STATISTICA as their tool of choice. Behind this top ranking were some strong showings within various user subgroups. STATISTICA was identified by the highest percentage of corporate data miners as their primary data mining tool, selected over any other commercial or GPL-licensed package. It was also identified as the primary commercial tool among academic data miners. Among non-profit, NGO, and government data miners, most identified STATISTICA as a primary commercial tool in a three-way tie with SAS and IBM SPSS.
Conducted regularly since 2007, the Rexer Analytics Survey is the largest of its kind, surveying data miners from 75 countries during the spring of 2013. The survey’s questions covered analytic techniques and tools used in data mining practice, types of data analyzed, and challenges encountered.
A full report of the survey will be made available from Rexer Analytics.
The BRICS-CCI & CBIC 2013 Congress in Brazil recently published results of its first data mining algorithm competition, announcing that StatSoft’s solution achieved first-place and second-place wins for STATISTICA Data Miner.
The international Computational Intelligence Algorithm Competition (CIAC), co-organized by NeuroTech S.A., focused for the first time on data mining applied to credit scoring. Participants were required to address the effects of temporal degradation of performance and seasonality of payment delinquency. Operating under the “Team Sandvika” name, StatSoft Norway’s Country Manager Knut Opdal and Rikard Bohm submitted a solution that demonstrated the superiority of modern data mining methods over the more traditional logistic regression method.
Their STATISTICA model placed first when judged for fitting of estimated v actual delinquency of approved credit applications, and it placed second when judged for robustness against performance degradation over a multi-year data set.
“We used this competition as an opportunity to compare the performance of different classes of predictive models,” Opdal and Bohm stated. “In particular we wanted to compare the industry standard method (logistic regression) with Boosting Trees, Random Forrests, MARSplines and neural network.”
Open to participants from academia and industry, the competition consisted of two tasks. Task 1, which focused on performance degradation, was evaluated based on the usual area under Receiving Operator Characteristic (ROC) curve metrics. Task 2 focused on fitting of estimated v actual delinquency of those credit applications approved by the Task 1 model. CIAC organizers claimed this second task represents a realistic innovation in data mining competitions worldwide by emphasizing the relevance of the quality of future delinquency estimation instead of the usual lowest future average delinquency.
As a top performer in these two tasks, StatSoft Norway was invited to present its winning solution in a special CIAC track scheduled during the three-day BRICS-CCI & CBIC Congress in September. Opdal arranged via Skype to deliver the white paper, “Benchmarking of Different Classes of Models Used for Credit Scoring,” in which he described the methodology and techniques he and Bohm applied.
In the presentation Opdal and Bohm conclude that, as volume of data and/or variables increases, score model performance using modern data mining techniques (e.g., Boosting Trees, MARSplines, Neural Network) is “significantly better than (with) traditional scorecards.”
Competition results are listed here.
Watch Knut’s oral presentation of STATISTICA’s winning solution here.
About BRICS-CCI & CBIC
BRICS is an acronym for the economic group of Brazil, Russia, India, China, and South Africa, which held its first Congress on Computational Intelligence (BRICS-CCI 2013) during September, alongside the 11th bi-annual Brazilian Congress on Computational Intelligence (CBIC). The Congress’ website describes the objective of BRICS-CCI 2013 is to provide a high-level international forum for scientists, researchers, engineers, and educators to disseminate their latest research results and exchange views of future research directions.
What is the relationship between Cognitive Mining and Data Mining? I discuss this, what makes StatSoft different, achieving user satisfaction, Big Data and Privacy with StatSoft VP Dr. Thomas Hill.
By Gregory Piatetsky, Oct 14, 2013.
What is the relationship between Cognitive Mining and Data Mining?
According to Wikipedia, Lewicki and Hill showed that advanced expertise acquired by humans via experience, involves the acquisition and use of patterns that can be more complex than what humans can verbalize or intuitively experience. Frequently such patterns involve high-order interactions between multiple variables, while human consciousness usually can handle only first and second order interactions.
Dr. Thomas Hill is a VP Analytic Solutions at StatSoft Inc., where he worked for over 20 years on development of data analysis, data and text mining algorithms, and the delivery of analytic solutions. He was a professor at the U. of Tulsa from 1984 to 2009, where he taught data analysis and data mining courses. Dr. Hill has received numerous academic grants and awards from NSF, NIH, the Center for Innovation Management, the Electric Power Research Institute, and other institutions.
Here my interview with Dr. Hill.
Gregory Piatetsky, Q1: Your landmark research with Paul Lewicki [and Maria Czyzewska] on “Nonconscious social information processing” showed that humans can acquire complex advanced expertise that they cannot verbalize. This suggests a limitation of expert-hypothesis-driven data analysis methods, because they rely on testing hypotheses that have to be explicitly formulated by researchers.
What are the broad implications for data mining and data science?
Thomas Hill: Lewicki and others (including some research published by Thomas Hill) have demonstrated over a wide range of human experiences and expertise, that exposure to complex and rich stimuli, consisting of large numbers of sensory inputs and high-order interactions between the presence or absence of specific features, will stimulate the acquisition of complex procedural knowledge without the learners’ conscious awareness. Hence the acquisition of such knowledge is best characterized as non-conscious information acquisition and processing.
For example, when humans look at sequences of abstract pictures, faces, or tracking targets over seemingly random locations on the screen, carefully calibrated measures of procedural knowledge (e.g., based on response times) will reflect the acquisition of knowledge about complex covariations and rules inferred from the rich and complex stimuli.
The conclusions from this research are highly relevant for understanding how large amounts of high-dimensional information, consisting of complex interactions between numerous parameters, can be derived efficiently through systematic exposure to relevant stimuli and exemplars. Specifically:
- It appears that knowledge about complex interactions and relationships in rich stimuli are the result of the repeated application of simple covariation-learning algorithms that detect co-occurrences between certain stimuli and combines them into complex interactions and knowledge
- In human experts, most of this knowledge is procedural in nature, not declarative; in short, experienced experts can be effective and efficient decision makers but are poor at verbalizing how those decisions were made
- When the covariations and repeated patterns in the rich stimulus field change, so that previously acquired procedural knowledge is no longer applicable, experts are slow to recognize this, and are often confused and reluctant to let go of “old habits”
Human expertise and effective decision making can be remarkable in many ways:
- It is capable of leveraging “big data,” i.e., is remarkably capable with respect to the amount of information and stored knowledge that is used.
- It is capable of coping with high-velocity data, i.e., it is very fast, with respect to the speed with which information is synthesized into effective, accurate decisions.
- It is very efficient, with respect to how little energy our brain requires to process vast amount of information, and makes near-instant decisions.
From the perspective of analytic approaches, these capabilities are accomplished through the repeated application of simple learning algorithms to rich and complex stimuli to identify repeated patterns that allow for accurate expectations and predictions regarding future events and outcomes.
It seems that big-data-analytics is converging on this approach as well: Applying large numbers of models, based on the application of general approximators to relevant diverse exemplars is in most cases the best recipe for extracting complex information from data.
GP, Q2: Your findings reminded me of Leo Breiman famous 2001 paper “Statistical Modeling: The Two Cultures” (Statistical Science 16:3), where he writes
“There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.”
Leo Breiman put himself in the second culture which he described (in 2001) as a small minority of researchers. Do your findings support the second, algorithmic, data-driven culture of data analysis, and if so how?
TH: See also the response to Q1. Obviously, there is and always will be applications for statistical hypothesis testing and modeling. In particular in science, it remains critical that evidence for theories and theoretical understanding of reality is advanced by testing a-priori hypotheses derived from theories, or by refining a-priori expectations.
There are also applications where this approach is critical: Recall that human “experts” (with highly evolved procedural knowledge in some domain) are usually not good at responding and understanding when old rules no longer apply.
If the mechanisms generating data are not understood (e.g., why a drug is effective), it can easily happen that something changes that renders old findings no longer to be predictive of future outcomes. In medicine, such errors can be critical.
GP, Q3. How did your research in Cognitive psychology influenced STATISTICA?
TH: Most importantly, it has driven the roadmap with respect to what algorithms we embraced and refined. For example, boosting of simple learners against repeated samples of diverse exemplars (e.g., stochastic gradient boosting) is one of the algorithms that in our minds “mimics” in many ways the way that humans acquire procedural learning.
GP, Q4: StatSoft took the 1st place in user satisfaction in 2013 Rexer Analytics Survey (followed by KNIME, SAS JMP, IBM SPSS Modeler, and RapidMiner) and had high satisfaction in other user surveys. Who are your typical users and how do you achieve such satisfaction?
TH: We have always maintained a very disciplined approach to log and “digest” customer feedback. As a result, in many ways we may well have the simplest point-and-click interfaces to build even complex models.
The other big factor, in our experience, is the fact that STATISTICA is very easy to integrate into existing IT assets, regardless if they depend on Excel or text files, or distributed file systems and web-based data services and schedulers. One way to look at our platform is as a development platform that is highly compliant with standard interfaces, programming and scripting languages, and so on. We know for sure that this makes deployments of our platform at our larger Enterprise clients much easier and cost effective: In many ways, STATISTICA will simply be just another (Windows) service running, against existing standard database tables that store all data and metadata. So no new IT skills are required.
In practice, projects can fail when a platform does not integrate–or integrate easily–with what is already there, or fails to enable practitioners and non-data-scientists to do useful work quickly. STATISTICA is very good at that.
GP, Q5: How would you compare StatSoft STATISTICA Data Miner with other similar products ? How do you compete with enterprise products like SAS, IBM SPSS Modeler on one hand, and free, open source software like R, KNIME, or RapidMiner, on the other hand ? What are some new exciting features you are planning to add?
TH: Regarding our competitive advantages over products from SAS and IBM, they are, of course, tough competitors, and we understand that we will win customers only if we outperform our competitors in the areas that are most relevant to the users.
Needless to say, we are working hard to achieve that goal and in the last two years have made significant progress as indicated by market share. Where exactly are our specific strengths in relation to products from these two competitors? I would prefer users (who are the most impartial judges) to answer these questions for you…
Regarding R and other open source software-we certainly do NOT consider them to be our competitors but, rather, most welcome allies who help proliferate the use of advanced analytics in addition to making significant contributions to the science that we all rely on.
StatSoft has been one of the first commercial software companies that fully embraced (in the sense of supporting) R, by incorporating a seamless integration between R and our platform. Also, to the best of our knowledge, we are the only one among the major data mining companies that has contributed to R by enhancing its functionality (i.e., by releasing functionality to the R community under unrestricted GPL licensing).
On the other hand, StatSoft’s customers depend on us for our analytics systems, platforms, and solutions that are validated, meticulously tested, follow carefully controlled software life-cycle management procedures, and are developed in close collaboration with end-users in the respective industries to meet their detailed requirements. The open-source world has been and continues to be a wonderful “Wikipedia of statistics and analytics” – a dynamic forum of ideas, new algorithms, methods, technologies.
Commercial applications for mission critical applications require stringent software development procedures, software lifecycle management, validation, test cases, requirements documents, and so on. For example, in medical device and pharmaceutical manufacturing, analytics have to be validated, documented, and then “locked down.” This means features such as version control of analytic recipes, audit logs, and approval processes are all critical features.
In our opinion, open-source code will continue to grow and provide important new ideas. At the same time, commercial and/or mission critical applications will also continue to rely on STATISTICA for its functionality that continues to be developed in direct response to real-life use cases and to the endless lists of requirements that are dictated by constant interactions with the customers who use our software for mission critical applications.
Also, unlike open source software that delivers immensely valuable ideas and implementations but that is less disciplined in its product lifecycle management aspects, the STATISTICA software is strictly validated in a highly disciplined environment, while following the product life cycle management that adheres to SOP’s and-for example-maintains backwards compatibility with the previous versions. So you will never encounter a situation when some “new and improved” version of STATISTICA will break the previous implementation of our technology at customer sites. Also, STATISTICA software is entirely free of the restrictions that some of the open-source tools and algorithms place on commercial use (something we respect and honor).
Regarding roadmap and “exciting new features”: Without giving away the “punch line”, suffice is to say that one of the opportunities of big-data is to build, manage, and maintain large numbers of models. Again, this is something we have seen for a while in manufacturing (thousands of parameters recorded second-by-second to describe very complex processes). This means that a challenge for big-data analytics is to automate model building itself, enable effective model-sentinels that know when to recalibrate models, and do so automatically. In short, the challenge is to enable fewer data analysts and scientists to manage more models (perhaps thousands per analyst), and to fully take advantage of the data that are collected at ever increasing speed and volume. That is where a lot of our R&D has been going for a while.
GP, Q6: You have been a professor at U. of Tulsa for 25 years. How did you combine research at U. of Tulsa with work at StatSoft and what eventually caused you to leave university and work for StatSoft full-time?
TH: We never really combined the two. I left The University of Tulsa in the late nineties (and after ten years). That was an exciting time when many of the algorithms and approaches commonly applied today started to emerge. I wanted to play a role in this emerging technology, based on my understanding of at least some of the basic mechanisms responsible for the incredible data processing capabilities of the human mind.
GP: Here is the second part of the interview with Dr. Thomas Hill on Cognitive Mining, Data Mining, and StatSoft.