Monthly Archives: October 2013

Data Mining…does it have similarities to “psychohistory”?

the-foundation-seriesby Win Noren

Reading articles and news stories about the data mining that companies and governments conduct on routine correspondence, telephone calls, purchases, and all varieties of habits such as web searches, shopping and personal preferences is certainly enough to make one paranoid.

Certainly I can understand people being upset upon learning about how government agencies are spying on routine activities. In fact I cannot help but think about the Foundation series by Isaac Asimov. Perhaps I am the only one who sees a similarity between the powers of data mining “big data” and the branch of mathematics from the Foundation books called “psychohistory.”

In the Foundation series (especially the original trilogy: Foundation, Foundation and Empire, and Second Foundation) mathematician Hari Seldon used the laws of mass action to predict the future on a large scale. His predictions worked on the principle that the behavior of a mass of people is predictable if the quantity of people was very large (like the population of the universe).

Of course this is not the same thing as Target predicting which customers are expecting babies so they can be sent special promotional offers at a key “flux” point in their purchasing habits but it does ring of many of the same concepts.

Perhaps I need to dig out my old copies of the Foundation trilogy for a good read over the weekend.

Advertisements

Detecting Interactions Between Drugs

pillsBy Win Noren

Interactions between drugs is a growing problem. Particularly as the population age and the number of drugs (or over the counter treatments and supplements) prescribed increases it is more and more likely that interactions between one or more drugs could result in serious side effects. Of course the FDA keeps records of reported interactions and we all hope that our pharmacist is keeping an eye out for known interactions. But recent studies have shown that there is another source of finding hidden drug interactions: web searches.

 

 

Researchers from Stanford University in California mined the FDA’s database of adverse drug effects and found that two commonly used drugs – the antidepressant paroxetine and pravastatin, used to lower cholesterol – when used in combination puts patients at risk to develop diabetes.

The research into refining how to conduct such an analysis and the work from analyzing the FDA’s adverse drug effects database resulted in 47 previously unreported drug-drug interactions being detected and reported.

This is great, but of course many adverse effects may be noticed by patients and not mentioned to a doctor let alone reported to the FDA. That is where the new research on web searches come in. The researchers from Stanford analyzed web search logs from 2010 (before the paroxetine-pravastatin interaction was discovered) and found that there was a clear spike in searches of symptoms of diabetes and these drug names.

It is interesting and useful that researchers found a combination of drugs and symptoms that when searched together indicated a drug-drug interaction. But what I find most interesting is that the researchers were able to mine the web search logs and detect the drug-drug interaction even when the searches of the drug names and symptoms were made separately, even days or weeks apart.

This type of data mining could be an important part of the arsenal to detect drug-drug interactions earlier and before more people suffer from the interactions.

Customer Insight

fp-banners-dnn-customer-analytics

 

 

 

The best marketing departments have always maintained their primary focus on the desires of the customer. Now, Customer Relationship Management technology has made it increasingly more feasible to find out exactly what the customer wants, when they want it, how they want it delivered, and why they want it.

Targeted marketing has never been more possible or more vital in today’s unpredictable marketplace.  Savvy marketing departments use STATISTICA Data Miner to mine for:

STATISTICA White Paper Preview

White paper on data mining in marketing

  • Customer profiles
  • Opportunities for cross-selling and up-selling
  • Effective marketing campaign strategies
  • Optimal inventory and packaging

STATISTICA Data Miner Tools can help you find:

  • Customers who respond to new products
  • Customers who respond to discounts
  • Customers who buy in specific product categories
  • Your most loyal customers
  • Geographic or other differences
  • Each marketing campaign’s ROI
  • And much more

STATISTICA Solution

  • Enterprise wide solution: A multi-user, role based, secure STATISTICA Enterprise platform allows for a truly collaborative environment to build, test and deploy the best possible models.
  • Enhanced Text Analytics: STATISTICA provides an advanced text miner tool to better leverage unstructured/textual data.
  • Cutting-edge Predictive Analytics: STATISTICA provides a wide variety of basic to sophisticated algorithms to build models which provide the most lift and highest accuracy for customer insight.
  • Innovative Data Pre-processing Tools: STATISTICA provides a very comprehensive list of data management and data visualization tools.
  • Powerful Statistical Tools: STATISTICA provides you with an arsenal of the most power statistical tools available.
  • Reflexive models for realtime needs: Use Live Score® to process new issues as they occur, and update your models in turn-around times made possible only by STATISTICA’s integrated solutions.
  • Integrated Workflow: STATISTICA Decisioning platform provides a streamlined workflow for powerful, rules-based, predictive analytics where business rules and industry regulations are used in conjunction with advanced analytics to build the best models.

Market Basket Analysis

fp-banners-dnn-market-basket

 

 

 

Market basket analysis is the study of items that are purchased (or otherwise grouped) together in a single transaction or multiple, sequential transactions. Understanding the relationships and the strength of those relationships is valuable information that can be used to make recommendations, cross-sell, up-sell, offer coupons, etc. The analysis reveals patterns such as that of the well-known study which found an association between purchases of diapers and beer.

Transactional databases can be quite large and cumbersome. Extracting valuable information requires the right toolset. STATISTICA provides advanced analytic tools to explore market baskets and find actionable information therein. The rules uncovered with Sequence Association and Link Analysis can be deployed to new databases to show the probability of purchasing item x, given item y is in the shopping cart.

STATISTICA screenshot association rules network

In this plot, items that occur together in transactions are shown. The strength of the relationships is conveyed in the thickness of the line. The frequency of the item is conveyed in the size of the circle that represents a product.

STATISTICA Solution

  • Full Range of Solutions: Preparing data for analysis, feature selection, model selection, model evaluation and deployment. STATISTICA incorporates all these into one software package.
  • The most powerful algorithms available: STATISTICA incorporates not only traditional data mining algorithms such as CART,CHAID etc. but also other powerful algorithms like Boosted Trees, Random Forests, Neural Networks for sophisticated propensity models.
  • Sequence and Association analysis: Find and quantify associations between purchases and behaviors with STATISTICA’s Sequence Association and Link Analysis tool. What items are frequently purchased together, subsequently, etc. Uncover these valuable patterns to improve marketing strategies.
  • Enterprise-Wide Convenience: Build models in one department, test models in another department, and then start scoring in offices worldwide. STATISTICA Enterprise™ is a truly collaborative tool that leverages the full power of the best  market basket analysis tools available.
  • Reflexive models for realtime needs: Use Live Score® to process new customers as they come in, and update your market basket models in turn-around times made possible only by STATISTICA’s integrated solutions.

Propensity and Best-Next-Action Modeling

fp-banners-dnn-marketing-propensity

 

 

 

Propensity and Best-Next-Action Modeling

More companies have started investing more time and money on predictive analytics in order to understand their customers’ behaviors in new ways. By analyzing cross-referenced customer profiles and purchase histories, these companies can predict the likelihood, or propensity, of future activity at a customer-specific level and then proceed with best next actions as necessary. For instance, anticipating future purchases enables a company to recommend (Cross-Sell /Upsell ) suitable products to customers.

Anticipating customer loss to competitors (i.e., churn), on the other hand, enables a company to intervene in hopes of increasing retention rates. All these profiles, purchase histories, best-next-action recommendations, and resolutions require large and potentially cumbersome transactional databases, so extracting valuable information requires the right toolset.

STATISTICA Solution

  • Full Range of Solutions: Preparing data for analysis, feature selection, model selection, model evaluation and deployment. STATISTICA incorporates all these into one software package.
  • Highly Sophisticated Tools: Build Propensity Models or Customer Behavior Scoring Models to predict future behavior of customers. Highly sophisticated and robust tools are also available for performing Recency, Frequency and Monetary (RFM) Analysis of previous purchases to understand customer behavior and define market segments.
  • Sequence and Association Analysis: Find and quantify associations between purchases and behaviors (e.g., what items are frequently purchased together, subsequently, etc.) with STATISTICA’s Sequence Association and Link Analysis tool. Uncover these valuable patterns to improve marketing strategies.
  • Enterprise-Wide Convenience: Build models in one department, test models in another department, and then start scoring in offices worldwide. STATISTICA Enterprise™ is a truly collaborative tool that leverages the full power of the best  propensity models available.
  • Reflexive Models for RealTime Needs: Live Score® processes new transactions as they happen and updates propensity models in rapid turn-around times made possible only by STATISTICA’s integrated solutions.

World’s Largest Data Miner Survey: STATISTICA Remains #1 in User Satisfaction and Primary Tool

Rexer-2013-tools-ranked-by-satisfaction

Highlights from the 2013 Data Miner Survey were released by Rexer Analytics at the 2013 Predictive Analytics World conference on September 30. Complete results will be released later this year.

For the fourth time in a row, StatSoft’s flagship platform, STATISTICA, received both the highest rating in overall user satisfaction as well as the highest selection as “primary” data mining tool overall from among commercial data mining packages.

Dr. Karl Rexer presented his summary findings with charts showing 95% of STATISTICA users reported being “satisfied” or “extremely satisfied” with STATISTICA. Of these respondents, about two-thirds had applied the “extremely satisfied” ranking to STATISTICA, representing a ratio considerably greater than that earned even by the second-place tool. In fact, STATISTICA‘s “extremely satisfied” ratio was more than double that of most commercial tools in the survey.

 

Significantly, no data miners recorded any level of “dissatisfaction” or “extreme dissatisfaction” with the STATISTICA tool.

STATISTICA‘s selection as primary commercial data mining tool overall was determined by those respondents with access to multiple data mining tools who most often select STATISTICA as their tool of choice. Behind this top ranking were some strong showings within various user subgroups. STATISTICA was identified by the highest percentage of corporate data miners as their primary data mining tool, selected over any other commercial or GPL-licensed package. It was also identified as the primary commercial tool among academic data miners. Among non-profit, NGO, and government data miners, most identified STATISTICA as a primary commercial tool in a three-way tie with SAS and IBM SPSS.

Rexer 2013 highlights, primary tools

Conducted regularly since 2007, the Rexer Analytics Survey is the largest of its kind, surveying data miners from 75 countries during the spring of 2013. The survey’s questions covered analytic techniques and tools used in data mining practice, types of data analyzed, and challenges encountered.

A full report of the survey will be made available from Rexer Analytics.

STATISTICA Reduces Emergency Room Medical Costs: Study

Alekseiev-and-Harris-IHI-residents-300
StatSoft, Inc., announces that the medical and financial value of its STATISTICA predictive analytics solution was recently confirmed yet again in a cost-savings study conducted by doctoral residents affiliated with St. John Health System.
The study, “Detection of Stress Induced Ischemia in Patients with Chest Pain After ‘Rule-out ACS’ Protocol,” was conducted by third-year resident Sergii Alekseiev, MD, and Chief Resident Ambria Harris, DO, under the auspices of Gary Miner, PhD, and Linda Miner, PhD.
The aim of their study was to identify the impact of improved diagnostic accuracy achieved through STATISTICA’s predictive analytics solutions.
To accomplish this, Alekseiev and Harris assessed the current practice of stress imaging in emergency room patients admitted with chest pain after “rule-out ACS” protocol. This was done by developing an advanced predictive model using STATISTICA Data Miner that accurately predicts the outcome of  further testing to determine the likelihood that such patients are at risk for cardiac events. Conventional practice requires further expensive testing and hospitalization while patients wait for test results. But, according to Alekseiev and Harris’ conclusions, the results of the STATISTICA model suggest “that a substantial subset of patients with chest pain and negative ACS workup can be safely discharged without performing MPS and subsequent outpatient follow-up,” thus improving patient care and substantially reducing medical costs.
Alekseiev and Harris presented their research findings to an international audience at the annual Scientific Assembly and Graduation of the In His Image (IHI) Family Medicine Residency program of Tulsa, held in June 2013 at Sequoia Lodge in Oklahoma.
Their study will be extended by the current group of second-year IHI residents, who will also pursue development of a smart phone app that emergency room doctors can use to submit predictor variables to the STATISTICA model.
Full results of this study will be included as a guest tutorial in the upcoming book, Practical Predictive Analytics and Decisioning Systems for Medicine: Informatics Accuracy and Cost-Effectiveness for Healthcare Administration and Delivery Including Medical Research, by Dr. Linda Miner, et al (Elsevier, 2014).
About StatSoft, Inc.
StatSoft was founded in 1984 and is now one of the world’s largest providers of analytics software, with 30 offices around the globe and more than one million users of its STATISTICA software. StatSoft’s solutions enjoy an extremely high level of user satisfaction across industries, as demonstrated in the unprecedented record of top ratings in practically all published reviews and large, independent surveys of analytics users worldwide. With its comprehensive suite of STATISTICA solutions for a wide variety of industries, StatSoft is a trusted partner of the world’s largest organizations and businesses (including most of the Fortune 500 companies), providing mission-critical applications that help them increase productivity, control risk, reduce waste, streamline operations, achieve regulatory compliance, and protect the environment.

STATISTICA Wins International CIAC Credit Scoring Competition

BRICS-CCI-CBIC-logo-150The BRICS-CCI & CBIC 2013 Congress in Brazil recently published results of its first data mining algorithm competition, announcing that StatSoft’s solution achieved first-place and second-place wins for STATISTICA Data Miner.

The international Computational Intelligence Algorithm Competition (CIAC), co-organized by NeuroTech S.A., focused for the first time on data mining applied to credit scoring. Participants were required to address the effects of temporal degradation of performance and seasonality of payment delinquency. Operating under the “Team Sandvika” name, StatSoft Norway’s Country Manager Knut Opdal and Rikard Bohm submitted a solution that demonstrated the superiority of modern data mining methods over the more traditional logistic regression method.

Their STATISTICA model placed first when judged for fitting of estimated v actual delinquency of approved credit applications, and it placed second when judged for robustness against performance degradation over a multi-year data set.

“We used this competition as an opportunity to compare the performance of different classes of predictive models,” Opdal and Bohm stated. “In particular we wanted to compare the industry standard method (logistic regression) with Boosting Trees, Random Forrests, MARSplines and neural network.”

screenshot from Opdal/Bohm CIAC credit scoring white paperOpen to participants from academia and industry, the competition consisted of two tasks. Task 1, which focused on performance degradation, was evaluated based on the usual area under Receiving Operator Characteristic (ROC) curve metrics. Task 2 focused on fitting of estimated v actual delinquency of those credit applications approved by the Task 1 model. CIAC organizers claimed this second task represents a realistic innovation in data mining competitions worldwide by emphasizing the relevance of the quality of future delinquency estimation instead of the usual lowest future average delinquency.

As a top performer in these two tasks, StatSoft Norway was invited to present its winning solution in a special CIAC track scheduled during the three-day BRICS-CCI & CBIC Congress in September. Opdal arranged via Skype to deliver the white paper, “Benchmarking of Different Classes of Models Used for Credit Scoring,” in which he described the methodology and techniques he and Bohm applied.

In the presentation Opdal and Bohm conclude that, as volume of data and/or variables increases, score model performance using modern data mining techniques (e.g., Boosting Trees, MARSplines, Neural Network) is “significantly better than (with) traditional scorecards.”

Competition results are listed here.

Watch Knut’s oral presentation of STATISTICA’s winning solution here.

About BRICS-CCI & CBIC

BRICS is an acronym for the economic group of Brazil, Russia, India, China, and South Africa, which held its first Congress on Computational Intelligence (BRICS-CCI 2013) during September, alongside the 11th bi-annual Brazilian Congress on Computational Intelligence (CBIC). The Congress’ website describes the objective of BRICS-CCI 2013 is to provide a high-level international forum for scientists, researchers, engineers, and educators to disseminate their latest research results and exchange views of future research directions.

Exclusive: Cognitive Mining, Data Mining, and StatSoft – Interview with Dr. Thomas Hill

Source: http://www.kdnuggets.com 

statsoft-nonconscious-learning-small

 

 

 

 

 

 

 

What is the relationship between Cognitive Mining and Data Mining? I discuss this, what makes StatSoft different, achieving user satisfaction, Big Data and Privacy with StatSoft VP Dr. Thomas Hill.

By Gregory Piatetsky, Oct 14, 2013.

What is the relationship between Cognitive Mining and Data Mining?

The landmark research by Statsoft CEO Paul Lewicki and his co-author Thomas Hill, VP Analytic Solutions at StatSoft, proved that the connection is very deep and important.

According to Wikipedia, Lewicki and Hill showed that advanced expertise acquired by humans via experience, involves the acquisition and use of patterns that can be more complex than what humans can verbalize or intuitively experience. Frequently such patterns involve high-order interactions between multiple variables, while human consciousness usually can handle only first and second order interactions.

  Dr. Thomas Hill is a VP Analytic Solutions at StatSoft Inc., where he worked for over 20 years on development of data analysis, data and text mining algorithms, and the delivery of analytic solutions. He was a professor at the U. of Tulsa from 1984 to 2009, where he taught data analysis and data mining courses. Dr. Hill has received numerous academic grants and awards from NSF, NIH, the Center for Innovation Management, the Electric Power Research Institute, and other institutions.

Here my interview with Dr. Hill.

Gregory Piatetsky, Q1: Your landmark research with Paul Lewicki [and Maria Czyzewska] on “Nonconscious social information processing” showed that humans can acquire complex advanced expertise that they cannot verbalize. This suggests a limitation of expert-hypothesis-driven data analysis methods, because they rely on testing hypotheses that have to be explicitly formulated by researchers.
What are the broad implications for data mining and data science?

Thomas Hill: Lewicki and others (including some research published by Thomas Hill) have demonstrated over a wide range of human experiences and expertise, that exposure to complex and rich stimuli, consisting of large numbers of sensory inputs and high-order interactions between the presence or absence of specific features, will stimulate the acquisition of complex procedural knowledge without the learners’ conscious awareness. Hence the acquisition of such knowledge is best characterized as non-conscious information acquisition and processing.

Nonconscious Learning of Covariations For example, when humans look at sequences of abstract pictures, faces, or tracking targets over seemingly random locations on the screen, carefully calibrated measures of procedural knowledge (e.g., based on response times) will reflect the acquisition of knowledge about complex covariations and rules inferred from the rich and complex stimuli.

The conclusions from this research are highly relevant for understanding how large amounts of high-dimensional information, consisting of complex interactions between numerous parameters, can be derived efficiently through systematic exposure to relevant stimuli and exemplars. Specifically:

  • It appears that knowledge about complex interactions and relationships in rich stimuli are the result of the repeated application of simple covariation-learning algorithms that detect co-occurrences between certain stimuli and combines them into complex interactions and knowledge
  • In human experts, most of this knowledge is procedural in nature, not declarative; in short, experienced experts can be effective and efficient decision makers but are poor at verbalizing how those decisions were made
  • When the covariations and repeated patterns in the rich stimulus field change, so that previously acquired procedural knowledge is no longer applicable, experts are slow to recognize this, and are often confused and reluctant to let go of “old habits”

Human expertise and effective decision making can be remarkable in many ways:

  • It is capable of leveraging “big data,” i.e., is remarkably capable with respect to the amount of information and stored knowledge that is used.
  • It is capable of coping with high-velocity data, i.e., it is very fast, with respect to the speed with which information is synthesized into effective, accurate decisions.
  • It is very efficient, with respect to how little energy our brain requires to process vast amount of information, and makes near-instant decisions.

From the perspective of analytic approaches, these capabilities are accomplished through the repeated application of simple learning algorithms to rich and complex stimuli to identify repeated patterns that allow for accurate expectations and predictions regarding future events and outcomes.

It seems that big-data-analytics is converging on this approach as well: Applying large numbers of models, based on the application of general approximators to relevant diverse exemplars is in most cases the best recipe for extracting complex information from data.

GP, Q2: Your findings reminded me of Leo Breiman famous 2001 paper “Statistical Modeling: The Two Cultures” (Statistical Science 16:3), where he writes

“There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.”

Leo Breiman put himself in the second culture which he described (in 2001) as a small minority of researchers. Do your findings support the second, algorithmic, data-driven culture of data analysis, and if so how?

TH: See also the response to Q1. Obviously, there is and always will be applications for statistical hypothesis testing and modeling. In particular in science, it remains critical that evidence for theories and theoretical understanding of reality is advanced by testing a-priori hypotheses derived from theories, or by refining a-priori expectations.

There are also applications where this approach is critical: Recall that human “experts” (with highly evolved procedural knowledge in some domain) are usually not good at responding and understanding when old rules no longer apply.

If the mechanisms generating data are not understood (e.g., why a drug is effective), it can easily happen that something changes that renders old findings no longer to be predictive of future outcomes. In medicine, such errors can be critical.

GP, Q3. How did your research in Cognitive psychology influenced STATISTICA?

TH: Most importantly, it has driven the roadmap with respect to what algorithms we embraced and refined. For example, boosting of simple learners against repeated samples of diverse exemplars (e.g., stochastic gradient boosting) is one of the algorithms that in our minds “mimics” in many ways the way that humans acquire procedural learning.

GP, Q4: StatsoftStatSoft took the 1st place in user satisfaction in 2013 Rexer Analytics Survey (followed by KNIME, SAS JMP, IBM SPSS Modeler, and RapidMiner) and had high satisfaction in other user surveys. Who are your typical users and how do you achieve such satisfaction?

TH: We have always maintained a very disciplined approach to log and “digest” customer feedback. As a result, in many ways we may well have the simplest point-and-click interfaces to build even complex models.

The other big factor, in our experience, is the fact that STATISTICA is very easy to integrate into existing IT assets, regardless if they depend on Excel or text files, or distributed file systems and web-based data services and schedulers. One way to look at our platform is as a development platform that is highly compliant with standard interfaces, programming and scripting languages, and so on. We know for sure that this makes deployments of our platform at our larger Enterprise clients much easier and cost effective: In many ways, STATISTICA will simply be just another (Windows) service running, against existing standard database tables that store all data and metadata. So no new IT skills are required.

In practice, projects can fail when a platform does not integrate–or integrate easily–with what is already there, or fails to enable practitioners and non-data-scientists to do useful work quickly. STATISTICA is very good at that.

GP, Q5: How would you compare StatSoft STATISTICA Data Miner with other similar products ? How do you compete with enterprise products like SAS, IBM SPSS Modeler on one hand, and free, open source software like R, KNIME, or RapidMiner, on the other hand ? What are some new exciting features you are planning to add?

TH: Regarding our competitive advantages over products from SAS and IBM, they are, of course, tough competitors, and we understand that we will win customers only if we outperform our competitors in the areas that are most relevant to the users.

Needless to say, we are working hard to achieve that goal and in the last two years have made significant progress as indicated by market share. Where exactly are our specific strengths in relation to products from these two competitors? I would prefer users (who are the most impartial judges) to answer these questions for you…

Regarding R and other open source software-we certainly do NOT consider them to be our competitors but, rather, most welcome allies who help proliferate the use of advanced analytics in addition to making significant contributions to the science that we all rely on.

StatSoft has been one of the first commercial software companies that fully embraced (in the sense of supporting) R, by incorporating a seamless integration between R and our platform. Also, to the best of our knowledge, we are the only one among the major data mining companies that has contributed to R by enhancing its functionality (i.e., by releasing functionality to the R community under unrestricted GPL licensing).

On the other hand, StatSoft’s customers depend on us for our analytics systems, platforms, and solutions that are validated, meticulously tested, follow carefully controlled software life-cycle management procedures, and are developed in close collaboration with end-users in the respective industries to meet their detailed requirements. The open-source world has been and continues to be a wonderful “Wikipedia of statistics and analytics” – a dynamic forum of ideas, new algorithms, methods, technologies.

Commercial applications for mission critical applications require stringent software development procedures, software lifecycle management, validation, test cases, requirements documents, and so on. For example, in medical device and pharmaceutical manufacturing, analytics have to be validated, documented, and then “locked down.” This means features such as version control of analytic recipes, audit logs, and approval processes are all critical features.

In our opinion, open-source code will continue to grow and provide important new ideas. At the same time, commercial and/or mission critical applications will also continue to rely on STATISTICA for its functionality that continues to be developed in direct response to real-life use cases and to the endless lists of requirements that are dictated by constant interactions with the customers who use our software for mission critical applications.

Also, unlike open source software that delivers immensely valuable ideas and implementations but that is less disciplined in its product lifecycle management aspects, the STATISTICA software is strictly validated in a highly disciplined environment, while following the product life cycle management that adheres to SOP’s and-for example-maintains backwards compatibility with the previous versions. So you will never encounter a situation when some “new and improved” version of STATISTICA will break the previous implementation of our technology at customer sites. Also, STATISTICA software is entirely free of the restrictions that some of the open-source tools and algorithms place on commercial use (something we respect and honor).

Regarding roadmap and “exciting new features”: Without giving away the “punch line”, suffice is to say that one of the opportunities of big-data is to build, manage, and maintain large numbers of models. Again, this is something we have seen for a while in manufacturing (thousands of parameters recorded second-by-second to describe very complex processes). This means that a challenge for big-data analytics is to automate model building itself, enable effective model-sentinels that know when to recalibrate models, and do so automatically. In short, the challenge is to enable fewer data analysts and scientists to manage more models (perhaps thousands per analyst), and to fully take advantage of the data that are collected at ever increasing speed and volume. That is where a lot of our R&D has been going for a while.

GP, Q6: You have been a professor at U. of Tulsa for 25 years. How did you combine research at U. of Tulsa with work at StatSoft and what eventually caused you to leave university and work for StatSoft full-time?

TH: We never really combined the two. I left The University of Tulsa in the late nineties (and after ten years). That was an exciting time when many of the algorithms and approaches commonly applied today started to emerge. I wanted to play a role in this emerging technology, based on my understanding of at least some of the basic mechanisms responsible for the incredible data processing capabilities of the human mind.

GP: Here is the second part of the interview with Dr. Thomas Hill on Cognitive Mining, Data Mining, and StatSoft.

How to Customize Axis Labels to Show Logical Date Intervals

how-to-article
Visualizing data by date is a great way to explore trend. However, because time intervals are not fixed (e.g., a month can be from 28 to 31 days), a fixed axis scaling may not give the most logical axis representation of dates.
Suppose you have data that is collected daily but you want to label the axis in weekly, monthly, quarterly, or yearly intervals. With a new feature in STATISTICA 12, you can easily control the label of the axes in a way that is relevant and makes sense for time.
The data set, Retail.sta, is one of the example data sets included with STATISTICA. To open the data set, select the File tab. In the left pane, select Open Examples. In the Open a STATISTICA Data File dialog box, double-click on the Datasets folder, select the Retail.sta file, and click the Open button.
Opening the Retail.sta dataset
Select the Graphs tab. In the Common group, click Scatterplot to display the 2D Scatterplots Startup Panel. Click the Variables button to display the Select Variables for Scatterplot dialog box, and select Date as X and Sales as Y.
Dialog box for selecting Scatterplot variables
Click the OK button.
In the 2D Scatterplots Startup Panel, click OK. The resulting graph will be a scatterplot with default date intervals on the x-axis.
Graph with default data intervals on x-axis
The x-axis labels are every 500 units (days in this example) by default. The default units would be appropriate in many situations, but with respect to time, months or years would be more logical.
Now, to customize the date interval on the x-axis, double-click in the graph background to display the Graph Options dialog box.
Under the Axis heading, select Major Units. On the Mode drop-down list, select Manual. Then, select the Date/time step check box to generate a date/time step in the plot. Finally, on the Unit drop-down list, select Year. Note that the Date/time step option is available only when the Mode is set to Manual. Also, note that the step Size is 1 by default and can be adjusted as desired, and Unit can be set to Year, Month, Week, Day, Hour, or Second.
Customizing with Graph Options dialog box
Click OK to apply the changes to the graph that show the x-axis in yearly intervals as shown below.
Graph with forced yearly data intervals on x-axis
While this graph shows the same information, the x-axis labeling is clearly better for showing yearly trend.
You can also change the labeling on the x-axis, that is, you can start labeling from January (or any other month) instead of December. To change the labeling on the x-axis, double-click in the graph background to display the Graph Options dialog box.
Under the Axis heading, select Scaling. Change the Mode to Manual, and select January-1952 for the Minimum and January-1966 for the Maximum.
More customizing with Graph Options dialog
Now, select the Major Units tab, and on the Step Count drop-down list select From Minimum.
Customizing Step Count with Graph Options dialog box
Click OK to assign these changes to the graph. The x-axis is labeled from January as shown below.
Graph with forced January-Year intervals on x-axis
This graph shows the same information, but the x-axis has been modified to adjust for the first month of the year.
Thus, with the Date/Time step option, you can easily customize the axis labels to show appropriate dates.