Monthly Archives: May 2013
Plotting data across time helps to reveal interesting patterns and relationships. This was true of a study of weather and temperature patterns in Illinois that was conducted by Carl von Ende from the Department of Biological Sciences at Northern Illinois University.
The goal was to visually compare the trends in temperate between two years, 2008 and 2009. The data, available from the National Weather Service Forecast Office (of) Central Illinois, gives temperatures (measured in Fahrenheit) and dates when the measurements were taken.
In this article, we will explore the steps needed to create such a plot, including the creation of new variables, using time and date functions and graph customization tools.
The data used to create the graph is presented above. As you can see, there is only one date variable, and we need to obtain from this variable the year for categorization as well as the day of the year for plotting along an axis. Therefore, the first step is creating two variables with this data.
In STATISTICA, select the Data tab. In the Variables group, click the Variables arrow, and from the menu, select Add. In the Add Variables dialog box, add two variables after Temperature (in F).
After you click OK, two new variables are added to the data file, NewVar1 and NewVar2.
Double-click on the variable header for NewVar1 to display the variable specifications dialog box. Change the variable name to Day_of_Year. In the Long name field, enter a function that returns a numerical code for the day of the year. You can see this function and the required parameters by clicking the Functions button, which displays the Function Browser. In the Category list, click on Date/Time, and in the Item list, scroll down to and click on DTDAYOFYEAR.
This is the function that will be used to bring back a numerical code for the day of the year. As you can see from the description of the function, it returns a numerical code between 1 and 366 that represents the day of the year. Close the function browser and return to the variable specification dialog. In the Long name field, type =DTDayOfYear(‘Date’).
When you click OK, a message will be displayed, letting you know whether the expression is correct. If the expression is correct, click Yes and the variable will be renamed and numerical codes for the day of the year will be included in the cases for the variable.
Now, double-click on the variable header for NewVar2, rename the variable Year, and in the Long name field, type =DTYEAR(‘Date’). This will rename the variable and add a four-digit number for the year in each case of the data set.
You will now have the complete data set with two additional variables, Day_of_Year and Year. These variables will be used in the results graph.
To create the graph, select the Graphs tab. In the Common group, click Scatterplot.
STATISTICA is designed so that when creating a 2D Scatterplot, the most common options for creating a scatterplot are shown on the Quick tab, as shown below. On the Quick tab of the 2D Scatterplots dialog box, click the Variables button. Select Day_of_Year as the x-axis variable and Temperature (in F) as the y-axis variable. Click OK in the variable selection dialog box. Under Fit type, clear the Linear check box.
On the Categorized tab, in the X-Categories group box, select the On check box. Click the Change Variable button, and select Year as the categorization variable. Click OK. In the Layout group box, select the Overlaid option button.
Click OK to create the scatterplot graph.
We still need to convert the numerical codes to date format and to connect the data points by lines. To do this, double-click in the graph background to display the Graph Options dialog box.
Select the General tab for Plot, and select the Multiple lines check box.
Select the Scale Values tab for Axis, and ensure that the X Axis is specified. Then, for the Value format, select Date and the option for 17-Mar. Finally, under Options in the Layout drop-down list, select Perpendicular.
Click OK to create a graph that shows the data for Day_of_Year categorized by Year for Temperature (in F), as shown below.
STATISTICA HP – Big Data Performance using Massively Parallel and In-Memory Processing
StatSoft announces STATISTICA HP, the latest release of Version 12 of the STATISTICA analytics platform, designed to leverage information contained in extremely large data sets using massively parallel and in-memory processing.
This technology allows StatSoft customers to bring supercomputing performance to their big data, leveraging the power of multiprocessor servers that are rapidly becoming more affordable and widely available as part of existing computer infrastructures of not only large but also midsize and even some small companies. For example, the familiar Microsoft Windows® server-based environment is now available with up to 640 logical processors (with Windows Server 2012, up to logical 256 processors with Windows Server 2008 R2).
“StatSoft has the distinction of being the only analytics and predictive modeling platform specifically optimized for Windows computing platforms,” says Dr. Thomas Hill, StatSoft’s VP for Analytic Solutions. “With the latest release of STATISTICA HP, we have achieved remarkable performance on practically all computational tasks, in particular for in-memory data processing on high-performance servers.”
To illustrate the remarkable performance of the STATISTICA analytic system, StatSoft has conducted performance tests on a midrange, 64-core server machine with 256GB of RAM.
Statistical Computations and Summaries
As discussed in detail in the StatSoft White Paper (The Big Data Revolution And How to Extract Value from Big Data), many of the use cases around big and high-velocity data involve data summarization, aggregation, and the identification of basic relationships.
Shown below is a screenshot of STATISTICA running against a data set with 1 million records and 1,000 fields, computing 1 million correlations.
The STATISTICA software successfully distributes the required computational load over all of the available CPUs, utilizing 100% of the hardware resources available in this system. Computing 1 million correlations on a data set with 1 million records completes within seconds or less (depending on the clock speed, and memory access architecture of the system).
The Power of Parallel Processing for Predictive Modeling
The architecture of STATISTICA HP provides numerous optimizations that involve massive parallelization, both during the model building process as well as the scoring process.
For example, analytic workspaces such as the one shown below can be run on multicore servers where the competitive evaluation of multiple models is effectively performed in parallel across multiple cores, achieving 100% utilization of the computing resources of the system and yielding remarkable performance.
Building an effective tree-based classification model against 1 million records on the 64-core 256GB RAM platform described earlier completes in seconds or less (depending on the clock speed and memory access architecture of the system).
Also, many of StatSoft’s customers are currently using the STATISTICA Enterprise Server™ platform to enable massively parallel model-scoring in virtual on-demand environments, again highlighting the flexibility and utility associated with STATISTICA’s adherence to and compatibility with modern software standards, interfaces, and emerging computing technologies.
In addition, in STATISTICA HP 12, all advanced modeling algorithms–including the most powerful ensemble models such as random tree forests, gradient boosted trees, and others–are implemented to take advantage of large numbers of cores and available RAM for efficient in-memory model building against big data.
Computing platforms with large numbers of CPUs and cores and capabilities to handle huge data files via in-memory processing are rapidly becoming less expensive and more common not only in science but also in business use. Too often, however, the bottleneck is the (analytic) software which limits the performance that can be achieved with such hardware. According to George Butler, StatSoft’s VP for Platform Development, “StatSoft has accumulated significant expertise over decades on how to optimize the performance of analytic software, and the STATISTICA HP platform will fully take advantage of Microsoft’s newest server platforms supporting hundreds of cores.” He continues, “We are a close Microsoft Partner but also an Intel® Software Premier Elite Partner, and our R&D is constantly looking for new and better ways to leverage existing hardware and operating system resources.”
by STATISTICA Press Releases
In your statistical analysis adventures with STATISTICA, you may run into some error messages along the way. One such type of error tells you that a variable does not have enough codes. Let’s explore more of what this error is telling us and how to fix the data.
In statistical modeling tools like General Linear models and Data Mining tree algorithms like C&RT, CHAID, Boosted Trees and Random Forests, when a categorical predictor variable has less than 2 distinct levels, you get an error message that says: “Not enough codes selected for variable : The required minimum number of codes is 2.”
For Neural Networks and machine learning tools, you get a bit wordier message that is telling us the same thing. It says: “STATISTICA has detected an insufficient number of nominal levels in the categorical independent variable . All categorical variables must contain at least two nominal values. Please carefully check your data and case selection conditions and try again.
Here are some possible causes and solutions for this type of error message. I reserve the right to add to my list, if I think of something else.
- All records in the sample have the same response for this variable. Say the population you are studying is male subjects. You don’t need a gender variable in the model because all participants in your study are male. This gender variable should not be (and can’t be) used for analysis.
- Records are missing from our dataset. Say we want to compare several hospitals on various metrics. Each hospital has its own data set. These data files need merged before starting the analysis.
- Data was recorded as 1 if present and left blank otherwise. When this is the case, those blanks should be filled in with a value. When STATISTICA encounters a blank cell, it is taken to be unknown, missing. In this case, the value is not unknown; it is a unique group level. So to fix this, you might use the Process Missing Data tool to replace missing cells with a value. Or a spreadsheet function like “=iif(ismd(vcur)), 0, vcur)” will leave existing data unchanged and fill in missing cells with 0.
- Case selection conditions have excluded some cases, and now there is only 1 unique level to this variable. To solve this issue, either case selection conditions need changed or the predictor variable should not be included in the analysis.
- Missing data in another variable has excluded some cases, leaving only one unique level of a grouping factor. Missing data should be dealt with before starting an analysis. Model build tools can’t work with missing data and the records are thrown out. Here is a video that discusses dealing with missing data.
Now that you understand the cause of your error message, you are ready to get back to blazing your analytic trail.
Chemical and Petrochemical organizations are among the largest users of STATISTICA applications, benefiting from STATISTICA analytics both in Research & Development and Manufacturing.
Research & Development
One contributing factor in a chemical/petrochemical company’s success is the ability of the R&D scientists to discover and develop a product formulation with useful properties.
The STATISTICA platform results in hard and soft ROI by:
- Empowering scientists with the analytic and exploratory tools to make more sound decisions and gain greater insights from the precious data that they collect
- Saving the scientists’ time by integrating analytics in their core processes
- Saving the statisticians’ time to focus on the delivery and packaging of effective analytic tools within the STATISTICA framework
- Increasing the level of collaboration across the R&D organization by sharing study results, findings, and reports
STATISTICA provides a broad base of integrated statistical and graphical tools including:
- Tools for basic research such as Exploratory Graphical Analysis, Descriptive Statistics, t-tests, Analysis of Variance, General Linear Models, and Nonlinear Curve Fitting.
- Tools for more advanced analyses such as a variety of clustering, predictive modeling, classification and machine learning approaches including Principal Components Analysis, The STATISTICA platform meets the needs of both scientists and statisticians in your R&D organization.
Chemical and Petrochemical organizations have deployed STATISTICA within their manufacturing processes in several ways:
- These organizations have arrived at a greater understanding of their process parameters and their relationship to product quality by applying STATISTICA‘s multivariate statistical process control (SPC) techniques. STATISTICA integrates with their process information repositories and LIMS systems to retrieve the data required to perform these analyses.
- These organizations have also utilized the deployment capabilities of STATISTICA‘s Data Mining algorithms to integrate advanced modeling techniques such as Neural Network, Recursive Partitioning approaches (CHAID, C&RT, Boosted Trees), MARSplines, Independent Components Analysis, and Support Vector Machines. STATISTICA allows them to deploy a fully-trained predictive model in Predictive Modeling Markup Language (PMML), C++ and Visual Basic for ongoing monitoring of a process. These models based, once trained and evaluated on historical data, are deployed as “soft sensors” for the ongoing monitoring and control of process parameters.
by Win Noren
Health care and the costs surrounding it are laden with numbers. Each year we all see our insurance premiums go up, and sometimes even our out-of-pocket costs go up, too (like when the insurance premium goes up and co-pays to access medical care go up). The debate around medical care is contentious and the mis-use of statistics abounds, but that is a topic for a future post.
Recently a peer-reviewed article was published on PLOS Medicine which found a 25% lower relative risk of death through a specific treatment offered in a randomized trial to elderly Americans with at least one chronic illness. Even more amazing is that this treatment had no adverse side effects. And the treatment was cost effective. In fact, in the high-risk subgroup those receiving the treatment had 29% fewer hospitalizations and 20% lower expenditures than those who did not receive the treatment.
So what is this magic bullet and how can I get this treatment for the elderly in my life?
This study was the “Effect of a Community-Base Nursing Intervention on Mortality in Chronically Ill Older Adults: A Randomized Controlled Trial” and, simply put, it was the provision of a nurse who made weekly visits to the elderly patients in their homes. This study came to my attention through a lengthy but very interesting article in the Washington Post published in late April entitled, “If this was a pill, you’d do anything to get it.”
The author of the article is right: the results of this study are quite stunning. The results are exciting in part because it was a randomized trial where the participants were not “cherry picked” but, rather, were randomly assigned to either the treatment group that received the regular in-home nurse visits or to the control group that did not receive the in-home nurse visits. This means that we have a good expectation that these results will hold true for a bigger population instead of the results being a side-effect of putting all the healthiest subjects in the treatment group.
Of course, that IS the big question: Will such a treatment show a similar result in larger populations and in a more diverse population? Since the trial was conducted in a relatively small geographic area (Doylestown, PA), one questions whether the same results would hold true in a group of elderly that are more diverse in ethnicity and economic status.
This study, which was one of 15 conducted as part of a Medicare-funded “demonstration project,” was conducted by Health Quality Partners (HQP). Of the 15 projects only 4 showed improved patient outcomes without increasing costs, and only this study showed improved patient outcomes while decreasing costs. Unfortunately, it appears that Medicare will not be funding this study further. That decision does not seem to make any sense, given the exciting results of the project to date.
At the same time, perhaps there is encouraging news as Aetna has just announced that they are extending their relationship with HQP, as members who enrolled in this program continue “to have fewer hospital admissions and lower medical costs than members with similar conditions who did not participate.” So perhaps other insurance companies and maybe even Medicare will see a reason to join in such an effort.