Our previous How-To article, How to Deploy Models Using SVB Nodes, covered a topic that is becoming increasingly important, especially in data mining applications with a graphical user interface working with nodes that represent data mining algorithms. Rajiv Bhattarai covered the primary topic of deployment using the original STATISTICA Visual Basic (SVB) nodes. As STATISTICA reflects the rapid advances in technology and makes significant investments to remain a leader in predictive analytics, new nodes have been developed. This is a source of many questions, and this article will help to describe the differences between the scripted SVB nodes and the new STATISTICA Workspace nodes. Further, it will be shown how using the new nodes makes model deployment easier than ever.
New STATISTICA Workspace Nodes v. Scripted Nodes
As you work with STATISTICA Workspaces, you will see two types of nodes in practice; one is the scripted SVB nodes, which are the nodes described in the previous article and will not be the focus of this article. These are indicated by SVB on node icons, as you will see below. The new nodes are introduced as enhancements of the workspace user interface to closely resemble the interactive user interface in the respective modules. Below you will see a comparison of the Boosted Trees Classification SVB node and the new Boosted Classification Trees node.
Describing in detail all the additional features of the new nodes is beyond the scope of this article, but here are some highlights that will be beneficial to discriminate between the SVB and new nodes. A few of the properties of the new nodes are:
- Before the node is run, it will appear with a yellow background. When the node is run, the background will turn from yellow to clear, an indication that you have completed the analysis.
- Additional functionality is represented by icons on the node:
- Nodes are run by clicking the green arrow icon located at the lower-left corner of the analysis node.
- Parameters can be edited by clicking the grey gear icon at the upper-left corner of the node.
- Node results can be viewed by clicking the report icon at the upper-right corner of the node.
- Downstream results are indicated by a document icon at the lower-right corner of the node.
- Nodes can be connected by clicking the gold diamond icon at the center-right side of the node, holding down, and drawing an arrow to another node where you can release the click, thereby attaching two nodes together.
- Variable selection can be performed on the analysis node.
- The functionality of the node closely resembles the functionality of the respective interactive analysis. As you can see with the results options for the Boosted Classification Trees above, in the results alone, you have much more control over what output is provided upon completion of the analysis.
- Deployment functionality is built into the node.
Deployment Example with New Nodes
For this example using historical data of either Good or Bad credit, representing customers who satisfy or default on their loans respectively, we will build and compare the performance of two models to predict Good or Bad credit from future applicants using both Logistic Regression and Boosted Trees.
Open the data set provided with STATISTICA titled creditscoring.sta.
On the Home tab in the Output group, click Add to Workspace and select Add to New Workspace. In the title bar of the workspace, verify that Beta Procedures is selected.
As new nodes are created for algorithms, and as they are fully tested, they are made available in the All Validated Procedures selection. Boosted Trees Classification is currently available using this option. Logistic Regression is currently in the testing process and is therefore only available within the Beta Procedures area.
Within the data set, there is a variable titled TrainTest that separates the data into a training data set and testing data set. To separate this data into these separate groups, do the following:
On the Data tab in the Manage group, click Subset twice to add two subset nodes into the workspace. Verify that the subset nodes are connected to the data node. One helpful practice for modifying the workspace in order to clearly keep track of your analyses is to rename nodes according to your selection criteria. Edit the names of the nodes (right-click on the name and select Rename) to represent the training and testing data as illustrated below.
To edit the parameters of a node, you can either click the gray gear icon at the upper-left corner of the node or double-click the node. In the Include Cases group box, select the Specific, selected by option button. Enter the expression as shown in the next illustration.
Complete the same procedure for the subset node that represents the testing data.
In the workspace illustration above, you can see that the Training subset node has been run since it no longer has a yellow background (run your Training node by clicking the green arrow icon at the lower-left corner of the node). Also, the document icon at the lower-right corner means that there is data available for downstream analysis. Clicking on that document icon will open the available data, and when you scroll to the right of the data file, you can verify that only those cases with TrainTest = “Train” have been selected, indicating you have specified the correct inclusion criteria in the subset node.
Close the data set.
On the Data Mining tab in the Trees/Partitioning group, click Boosted Trees and select Boosted Classification Trees. On the Statistics tab, in the Advanced/Multivariate group, click Advanced Models > Generalized Linear/Nonlinear and select GLZ Custom Design (beta). Ensure that both nodes are connected to the Training node.
Edit the parameters of the Boosted Classification Trees analysis node and make the variable selections shown below.
In the Boosted Classification Trees dialog box, select the Code Generator tab. Verify that the only selection is for PMML.
Leave all other settings at their default values, and click the OK button.
Edit the parameters of the GLZ Custom Design (beta) node. On the Quick tab, select Logit model with a Binomial distribution using the Logit Link function.
On the Model Specification tab, make the same variable selections as indicated in the analysis node for Boosted Classification Trees, as well as only PMML selected on the Code Generator tab, and click OK.
To review the results of the analysis on the training data, you could double-click on the Reporting Documents icon. For this example, the focus will be on the performance of these models on the testing data. There are two points that need to be highlighted at this point in the example. The PMML that was generated in our analysis was automatically loaded into the PMML Model nodes, which are downstream of the analysis nodes. Edit the parameters of the PMML Model node that is connected to the Boosted Classification Trees analysis node and select the PMML tab.
You can see that the PMML script that represents this Boosted Classification Trees model is included in this node. Close the Deployment using PMML dialog box.
Connect the Testing subset node to the Rapid Deployment node. The Rapid Deployment node takes the models to which it is connected and applies those models to data to which it is also connected. In this example, it will take the Boosted Classification Trees and Logistic Regression models and apply them to the Testing data.
Run the Testing subset node and verify that you have correctly selected only the Testing data.
Edit the parameters of the Rapid Deployment node. You can review the options of the output from this node outside of this example, but you will find that there is a wide range of output available from including predicted probabilities in the output to ROC curves.
For this example, we will leave all settings at their default values with the exception of the Lift chart settings. On the Lift chart tab, verify that the Lift chart (lift value) check box is selected, with bad as the Category of response.
Run the Rapid Deployment node, which deploys the Boosted Trees and Logistic Regression models onto the Test data. After the node is run, the workspace will appear as below.
To review the results of the Rapid Deployment node, you can either double-click the Reporting Documents nodes, or you can click the document icon at the upper-right corner of the Rapid Deployment node. For this example, review the results by clicking on the appropriate icon on the Rapid Deployment node; this will bring you immediately to the Rapid Deployment results. Select the table of results for Summary of Deployment (Error rates) (Testing).
From this table, we can see that the Boosted Trees model had an error rate of 30.5% and the Logistic Regression model had an error rate of 26.3%. This indicates that at the default settings for the algorithms, the Logistic Regression model performs better than the Boosted Trees model. In the results folder, select the lift chart.
From this chart, we can see that if we applied both models to all the testing data, and took the top 20th percentile of those cases with the highest predicted probability of the classification Bad, the Logistic Regression model will have a lift value of approximately 1.9 while the Boosted Trees model will have a lift value of approximately 1.7. This again confirms that, using the default settings, the Boosted Trees model is outperformed by the Logistic Regression model.