Big Data, Too Much of Anything is Bad

There is an old saying “Too much of anything is bad”. I guess it applies for data analysis as well. smiley

The first time I was working on a real-world data mining problem, I was given a dataset with millions of cases and thousands of variables. I was trying to predict a target variable with the other variables. Also, I was trying to find out the variables which were having significant impact on the Target variable and rank them based on their significance.

I was wondering how to select the input variables (predictors) for predicting the Target variable, as choosing all the thousands of input variables doesn’t make any sense. Variables known to be unnecessary, as well as those variables that add a small amount of information, should be excluded from the analysis to overcome the Curse of Dimensionality. Pre-screening input variables can improve performance with respect to model building speed and predictive accuracy of data mining models.

Also, each predictor used by the data mining model is required to deploy that model to new cases. If good predictive accuracy is attainable with a smaller set of inputs, deployment is made easier.

Thank you to the STATISTICA Feature Selection and Variable Screening (FSL) module!

The STATISTICA FSL module acts as a pre-processor to select the list of top predictors that are likely related to the outcome variable. Furthermore, it ranks the predictors based on their significance for regression as well as classification-type problems. It uses both F and p-values as criterion for finding the predictor importance.

Figure 1: Feature Selection and Variable Screening Dialog Box

feature selection and variable screening


Figure 2: Selecting top ten predictors

feature selection


Figure 3: Predictor Importance

best predictors

Figure 4: Predictor Importance Plot

predictor importance plot

Figure 5: Predictor Importance Report

best predictors report
