Big Data, Too Much of Anything is Bad

star trek kirk tribbles image

There is an old saying “Too much of anything is bad”. I guess it applies for data analysis as well. smiley

The first time I was working on a real-world data mining problem, I was given a dataset with millions of cases and thousands of variables. I was trying to predict a target variable with the other variables. Also, I was trying to find out the variables which were having significant impact on the Target variable and rank them based on their significance.

I was wondering how to select the input variables (predictors) for predicting the Target variable, as choosing all the thousands of input variables doesn’t make any sense. Variables known to be unnecessary, as well as those variables that add a small amount of information, should be excluded from the analysis to overcome the Curse of Dimensionality. Pre-screening input variables can improve performance with respect to model building speed and predictive accuracy of data mining models.

Also, each predictor used by the data mining model is required to deploy that model to new cases. If good predictive accuracy is attainable with a smaller set of inputs, deployment is made easier.

Thank you to the STATISTICA Feature Selection and Variable Screening (FSL) module!

The STATISTICA FSL module acts as a pre-processor to select the list of top predictors that are likely related to the outcome variable. Furthermore, it ranks the predictors based on their significance for regression as well as classification-type problems. It uses both F and p-values as criterion for finding the predictor importance.

Figure 1: Feature Selection and Variable Screening Dialog Box

feature selection and variable screening


Figure 2: Selecting top ten predictors

feature selection


Figure 3: Predictor Importance

best predictors

Figure 4: Predictor Importance Plot

predictor importance plot

Figure 5: Predictor Importance Report

best predictors report
Header photo courtesy of United Federation of Planets Deep Space Station K-7 (Alpha Quadrant), originally sourced to Paramount Pictures.


About statsoftsa

StatSoft, Inc. was founded in 1984 and is now one of the largest global providers of analytic software worldwide. StatSoft is also the largest manufacturer of enterprise-wide quality control and improvement software systems in the world, and the only company capable of supporting its QC products worldwide, with wholly owned subsidiaries in all major markets (StatSoft has 23 full-service offices, on all continents), and its software is available in more than 10 languages.

Posted on December 14, 2012, in Uncategorized. Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: