Using Case States to Color or Mark by Category in a Graph

Written by: STATISTICA

In the 2011 Rexer Data Mining Survey, STATISTICA was the victor for the third year in a row as the primary data mining tool used by practitioners. Out of 20 commercial categories included in the survey, STATISTICA took first place in 16. Many of these categories include highly complex areas such as variety of available algorithms, quality and accuracy of model performance, and ability to modify algorithm options to fine-tune analyses.
One of the areas in which STATISTICA also took first place is an area that is sometimes overlooked but is equally important in data analysis: strong graphical visualization of models. Large amounts of data or very complex ideas can be conveyed using a single still image. STATISTICA’s graphical tool set makes it easy to generate images, and the ability to customize these images is limited practically only by the bounds of your imagination.
Box plots by default do not show raw data. The box plot, more formally known as the box-and-whisker plot, is an exploratory graphic. These plots are used to show the distribution of a data set at a glance. Any type of data you might use in a histogram can also easily be used in a box plot. When scatterplots, which do show raw data, are used, there is a Mark Selected Subsets feature available that distinguishes individual data points by group. However, Mark Selected Subsets is not included as an option for box plots. One way to represent individual data points by groups is to use case states to color or mark by category.
Included in STATISTICA is an example data set named Rats.sta, which includes data obtained from rats running a maze in an experiment. The rats were raised in two environments, a Free environment and a Restricted environment. Within each environmental group, there were three strains of rats: Bright, Mixed, and Dull. These rats were used in a maze test, and the number of errors the rats made in completing the test were recorded and included in this data set. To see the distribution of errors made by environmental group, create a box plot with Errors as the dependent variable and Environment as the grouping variable.
StatSoft Box Plot of ERRORS grouped by ENVIRONMENT
By looking at this box plot, it can be seen that restricted-environment rats made more errors than the rats raised in the free environment. However, you are not able to see the raw data in these graphs or the different strains of rats in each group. To see this data, you can use case states to color or mark by category.
To proceed with customizing the box plot, you first need to use case states to assign icons to each separate strain of rat within the data set. With the data set active, click on the Data tab. In the Cases group, click Cases, and from the Case States submenu, select Color or Mark By Category. In the Color or Mark By Category dialog box, click the Change Variable button. In the variable selection dialog box, select the variable Strain, and click OK. Then, click OK in the Color or Mark By Category dialog box. The graphic below shows what the data set will look like once case states are used to color or mark by category.
STATISTICA data set using case states to color or mark by category
This procedure has now assigned an individual color-coded icon to each category of rat included in the variable Strain. These icons can now be used to provide additional details in the box plot graph regarding the strain of rat. To specify the customized box plot, resume the analysis initially started for the box plot. In the initial box plot, the only specifications made were the variable selections of Error as the dependent variable and Environment as the grouping variable.
In the 2D Box Plots dialog box, click the Advanced tab. To include raw data points in the graph, select the Display raw data check box. If you generate a graph with this set of specifications, you will see that the graph still is not as readable as it could be.
STATISTICA Box Plot of ERRORS grouped by ENVIRONMENT
The icons for the different groups of strain overlap one another, and they are not included in the legend. To keep the icons from overlapping, there is another option that needs to be specified: Jitter. On the Advanced tab of the 2D Box Plots dialog box, shown below, you will see the option for Jitter beneath the Display raw data check box. The options specified for Jitter are Random and 50% width.
STATISTICA 2d box plots
Click OK to produce the graph. In the following image, the legend is slightly modified to remove the options for extreme data, and included are the special characters representing the strain of rats. This can be done by double-clicking on the legend and typing, or copying and pasting text, into the Titles/Text dialog box.
STATISTICA Box Plot of ERRORS grouped by ENVIRONMENT
You can now see not only the raw data in the box plot, but also whether each data point belongs to the Bright, Mixed, or Dull strain of rat. With this level of customization, you can see that both the Bright and Dull strains of rats had levels of error in both the high and low extremes of the distribution of error, and this can be seen in both the Free and Restricted environments. However, the Mixed strain had error levels in the lower distributions of the groups, with only one data point close to the median in the Restricted environment. Is there something about the Mixed strain of rat that makes it possible for them to make fewer errors than either the Dull or Bright strains? This level of customization illuminates this relationship and makes it easily distinguishable to whoever now reviews these graphs.
Graphs in STATISTICA can be customized to display complex relationships among variables that may not be easily distinguishable in other formats.
Advertisements

About statsoftsa

StatSoft, Inc. was founded in 1984 and is now one of the largest global providers of analytic software worldwide. StatSoft is also the largest manufacturer of enterprise-wide quality control and improvement software systems in the world, and the only company capable of supporting its QC products worldwide, with wholly owned subsidiaries in all major markets (StatSoft has 23 full-service offices, on all continents), and its software is available in more than 10 languages.

Posted on January 16, 2013, in Uncategorized. Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: