Using Case States to Color or Mark by Category in a Graph
Written by: STATISTICA
In the 2011 Rexer Data Mining Survey, STATISTICA was the victor for the third year in a row as the primary data mining tool used by practitioners. Out of 20 commercial categories included in the survey, STATISTICA took first place in 16. Many of these categories include highly complex areas such as variety of available algorithms, quality and accuracy of model performance, and ability to modify algorithm options to fine-tune analyses.
One of the areas in which STATISTICA also took first place is an area that is sometimes overlooked but is equally important in data analysis: strong graphical visualization of models. Large amounts of data or very complex ideas can be conveyed using a single still image. STATISTICA’s graphical tool set makes it easy to generate images, and the ability to customize these images is limited practically only by the bounds of your imagination.
Box plots by default do not show raw data. The box plot, more formally known as the box-and-whisker plot, is an exploratory graphic. These plots are used to show the distribution of a data set at a glance. Any type of data you might use in a histogram can also easily be used in a box plot. When scatterplots, which do show raw data, are used, there is a Mark Selected Subsets feature available that distinguishes individual data points by group. However, Mark Selected Subsets is not included as an option for box plots. One way to represent individual data points by groups is to use case states to color or mark by category.
Included in STATISTICA is an example data set named Rats.sta, which includes data obtained from rats running a maze in an experiment. The rats were raised in two environments, a Free environment and a Restricted environment. Within each environmental group, there were three strains of rats: Bright, Mixed, and Dull. These rats were used in a maze test, and the number of errors the rats made in completing the test were recorded and included in this data set. To see the distribution of errors made by environmental group, create a box plot with Errors as the dependent variable and Environment as the grouping variable.
By looking at this box plot, it can be seen that restricted-environment rats made more errors than the rats raised in the free environment. However, you are not able to see the raw data in these graphs or the different strains of rats in each group. To see this data, you can use case states to color or mark by category.
To proceed with customizing the box plot, you first need to use case states to assign icons to each separate strain of rat within the data set. With the data set active, click on the Data tab. In the Cases group, click Cases, and from the Case States submenu, select Color or Mark By Category. In the Color or Mark By Category dialog box, click the Change Variable button. In the variable selection dialog box, select the variable Strain, and click OK. Then, click OK in the Color or Mark By Category dialog box. The graphic below shows what the data set will look like once case states are used to color or mark by category.
This procedure has now assigned an individual color-coded icon to each category of rat included in the variable Strain. These icons can now be used to provide additional details in the box plot graph regarding the strain of rat. To specify the customized box plot, resume the analysis initially started for the box plot. In the initial box plot, the only specifications made were the variable selections of Error as the dependent variable and Environment as the grouping variable.
In the 2D Box Plots dialog box, click the Advanced tab. To include raw data points in the graph, select the Display raw data check box. If you generate a graph with this set of specifications, you will see that the graph still is not as readable as it could be.
The icons for the different groups of strain overlap one another, and they are not included in the legend. To keep the icons from overlapping, there is another option that needs to be specified: Jitter. On the Advanced tab of the 2D Box Plots dialog box, shown below, you will see the option for Jitter beneath the Display raw data check box. The options specified for Jitter are Random and 50% width.
Click OK to produce the graph. In the following image, the legend is slightly modified to remove the options for extreme data, and included are the special characters representing the strain of rats. This can be done by double-clicking on the legend and typing, or copying and pasting text, into the Titles/Text dialog box.
You can now see not only the raw data in the box plot, but also whether each data point belongs to the Bright, Mixed, or Dull strain of rat. With this level of customization, you can see that both the Bright and Dull strains of rats had levels of error in both the high and low extremes of the distribution of error, and this can be seen in both the Free and Restricted environments. However, the Mixed strain had error levels in the lower distributions of the groups, with only one data point close to the median in the Restricted environment. Is there something about the Mixed strain of rat that makes it possible for them to make fewer errors than either the Dull or Bright strains? This level of customization illuminates this relationship and makes it easily distinguishable to whoever now reviews these graphs.
Graphs in STATISTICA can be customized to display complex relationships among variables that may not be easily distinguishable in other formats.