Big Data, Too Much of Anything is Bad
There is an old saying “Too much of anything is bad”. I guess it applies for data analysis as well.
The first time I was working on a real-world data mining problem, I was given a dataset with millions of cases and thousands of variables. I was trying to predict a target variable with the other variables. Also, I was trying to find out the variables which were having significant impact on the Target variable and rank them based on their significance.
I was wondering how to select the input variables (predictors) for predicting the Target variable, as choosing all the thousands of input variables doesn’t make any sense. Variables known to be unnecessary, as well as those variables that add a small amount of information, should be excluded from the analysis to overcome the Curse of Dimensionality. Pre-screening input variables can improve performance with respect to model building speed and predictive accuracy of data mining models.
Also, each predictor used by the data mining model is required to deploy that model to new cases. If good predictive accuracy is attainable with a smaller set of inputs, deployment is made easier.
Thank you to the STATISTICA Feature Selection and Variable Screening (FSL) module!
The STATISTICA FSL module acts as a pre-processor to select the list of top predictors that are likely related to the outcome variable. Furthermore, it ranks the predictors based on their significance for regression as well as classification-type problems. It uses both F and p-values as criterion for finding the predictor importance.
Figure 1: Feature Selection and Variable Screening Dialog Box
Figure 2: Selecting top ten predictors
Figure 3: Predictor Importance
Figure 4: Predictor Importance Plot
Figure 5: Predictor Importance Report
Header photo courtesy of United Federation of Planets Deep Space Station K-7 (Alpha Quadrant), originally sourced to Paramount Pictures.