STATISTICA Knowledge Base – Data Manipulation
Splitting and Merging Files
How do I split a spreadsheet into smaller files?
On the Data menu, there are two commands for splitting a spreadsheet: Subset and Random Sampling.
Subset creates a new spreadsheet based on specified variables and case selection conditions.
Random Sampling creates a new spreadsheet based on a simple random or stratified random sampling.
Select either Save or Save As from the File menu to save the new spreadsheet. The original spreadsheet is not in danger of being overwritten unless you use the same name.
Can I select random subsets of data?
Yes, you can. Select Random Sampling from the Data menu to display the Create a Random Sample dialog. Sampling types include simple, systematic, split node, and stratified. Various options are available for each sampling type.
How do I merge two STATISTICA Spreadsheets?
Select Merge from the Data menu to display the Merge Options dialog.
Either variables (columns of data) or cases (rows of data) can be merged. In other words, the second file can be appended to either the “right side” (as variables) or “bottom” (as cases) of the first file. STATISTICA provides additional options for different merge types.
Can I merge the text labels/numeric values from two files?
Yes. Select Merge from the Data menu to display the Merge Options dialog, and select the Text Labels tab.
On this tab, you can select the way in which the text values from the two files are to be merged.
How do I merge two data sets that contain the same variables, but in different order?
Select Merge from the Data menu to display the Merge Options dialog, and select the Cases tab. After selecting the two data files in the lower section of the dialog, select the Match by variable name check box in the Mode box, and then click the OK button to execute the merge.
Can I merge the variables of one data set with another and match cases by multiple unique identifier variables?
Yes, you can select multiple variables by which to match when using the Match variables mode (accessible in the Merge Options dialog – Variables tab, Mode group box). The variables between the two data sets will be paired together in the (column) order that they appear in the data set.
How do I apply formulas to variables?
Variables in STATISTICA Spreadsheets can be defined by formulas that support a wide selection of mathematical, logical, and statistical functions.
When you enter a long variable name (in the Long name field on the Variable dialog) that starts with an equal sign, STATISTICA will assume that it is a formula and will verify it for formal correctness. If the formula is formally correct, you will then be given the choice to recalculate the variable now or later. You can use variable formulas to verify data, transform a variable, recode a variable or create values of the variable based on logical conditions (e.g., =(v0<=100)*1 + (v0>100)*2 will assign a value of 1 to cases number 1 through 100 and 2 to cases above 100). Refer to variables by their names (e.g., Test1, Income) or numbers (e.g., v1, v2, v3, …); v0 is the case number. A comment may be added to a formula following a semicolon.
What is the Function Browser?
Access the Function Browser by clicking the Functions button in the Variable dialog. The Function Browser helps you quickly access different functions and their syntax and descriptions. Functions are grouped by Category for easier searching.
Can STATISTICA automatically recalculate all spreadsheet formulas when the data change?
Yes. Select the Auto-recalculate when the data change check box in the Recalculate Spreadsheet Formulas dialog (accessed by clicking the Recalculate button on the toolbar, or selecting Recalculate Spreadsheet Formulas from the Data menu, or pressing Shift+F9).
STATISTICA will automatically recalculate all spreadsheet formulas when data are changed in the spreadsheet. Alternatively, you can press F9 to manually recalculate the formulas in the spreadsheet.
Recoding, Verifying, and Cleaning Data
How do I create values of a new variable based on conditions met by other variables?
You can use either of the data transformation facilities: spreadsheet formulas or STATISTICA Visual Basic. However, often the quickest way to create values of a new variable based on conditions met by other variables is to use the data recoding facility, Recode Values of Variable dialog, which is accessible at any point from the spreadsheet by clicking the Vars button and selecting Recode or selecting Recode from the Data menu. Note that the currently selected variable does not have to be included in the text of the recoding conditions. Thus, you can use this facility to create values of a variable based on conditions met by other variables.
For example, you can add a new (empty) variable to the data file, and then use this facility to create the new values. For instance, the recoding conditions could be used to assign 1s to the new variable for all “male subjects 18 to 25 years old with cholesterol levels below 200,” 2s to “male subjects 18 to 25 years old with cholesterol levels above 200,” and assign the missing data value to all other subjects.
What is the simplest way to recode values of a variable (e.g., split a continuous variable into categories)?
Spreadsheets have a quick recoding facility. Select Recode from the spreadsheet Vars toolbar button to display the Recode Values of Variable dialog. You can scroll down in this dialog to define new values (text or numeric) of the current variable (see the groups New Value 1, New Value 2, New Value 3, etc., below) depending on the specific conditions that you define (see the Category 1, Category 2, Category 3, etc., groups below).
For example, the recoding conditions specified above would “translate” the values less than or equal to 3 of the current variable (Quality) into Low, the values greater than 3 and less than or equal to 6 into Medium, and the values greater than 6 and less than or equal to 10 into High. Note that any other values that occur in the Quality variable would be “translated” into the missing data code because the MD Code option button is selected in the Other group box. Note also that STATISTICA first finds all cases that meet Category 1 and recodes them; then STATISTICA only searches through the remaining cases that have not been recoded to see if they meet Category 2, etc. Hence, you do not have to specify Quality > 3 AND Quality < = 6 in Category 2.
When specifying conditions, follow the standard syntax conventions common in STATISTICA to all those procedures that involve any operation of “selecting cases” based on their values.
How can I verify and “clean” data?
Select Verify Data from the Data – Verify Data submenu to access an interactive data-verification and cleaning facility. Use the options in the Verify Data dialog to enter the conditions to be met by the data.
Follow the standard syntax conventions common in STATISTICA to all those procedures that involve any operation of selecting cases based on their values. You can also save the current verification condition to a text file or open a file with previously saved conditions.
The verification can be as simple as checking whether values in a variable are “legal” (e.g., only 1 and 2 might be allowed for Gender) or whether they fall within allowed ranges of values (e.g., Age must be more than 0 and less than 100). It can also be as complex as checking multiple logical conditions that some values must meet in relation to other values.
Consider this example of conditional verification: If a person is a male or less than 10 years old, then the number of pregnancies for that person cannot be more than zero. In order to apply these conditions, you would specify (for example):
Invalid if: (v1=’MALE’ or AGE<10) and PREGN>0
Once you have entered your verification condition(s), click either the Find First button to select the first invalid case in your data file (after this first case has been selected, you can find the next case by selecting Find Next Invalid Case from the Data – Verify Data submenu) or click the Mark All button to mark all of the invalid cases in the data file according to the Marked Cells spreadsheet layout.
Can I sort a data set using a large number of sorting keys?
Yes. Select Sort from the Data menu to display the Sort Options dialog, which contains options for sorting variables. You can create up to 14 sorting keys in this dialog
How do I rank-order values of a variable (replace values with their ranks)?
Select Rank from the Vars button menu on the spreadsheet toolbar or from the Data menu to display the Rank Order Values dialog.
Then click the Variables button to select the variables to be ranked. Note that this operation replaces the original values with their respective ranks based on the sorted series. Optionally, you can also specify a subset of cases to be affected by the operation (click the Cases button), use case weights (click the Weight button), and set a number of options to perform specific types of ranking.
How do I transpose data (convert cases into variables)?
From the Data – Transpose submenu, select either Block or File.
Transposing a block of data. The block transposing command affects only the contents of cells in the block currently selected in the spreadsheet (the block must be square); the variable names and case names will not be affected. For example, the Transpose – Block command executed on the following square block of data:
produces the following result:
Transposing a data file. Data file transposing restructures the entire file. For example, transposing the data file shown above (before the block was transposed), changes the number of cases and variables in the data file:
Note that when you transpose a file, case names become variable names and variable names become case names.
How do I automatically fill ranges of data in the spreadsheet?
STATISTICA has a variety of operations that extrapolate (AutoFill) ranges.
- Random value fill. Select a block, right-click, and select Fill/Standardize Block – Fill Random Values from the spreadsheet shortcut menu. The currently selected block will be filled with random values (following a uniform distribution) in the range of 0 to 1. An error message is displayed if a block is not selected.
- Extrapolation of values (AutoFill). A selected series of values (consisting of at least two values) will be extended if a meaningful pattern is detected. For example, the following numbers: 1.00, 2.00, and 3.00 will be extended to 4.00, 5.00, 6.00, etc. Select the block of values to be extended, click on the small black handle in the lower-right corner, and drag through the next empty cells to extrapolate the values. If no meaningful pattern can be detected, predicted values from a linear trend regression will be used for the extrapolation.
- Fill down or right. Select a block of cells containing entries. Using the small black handle on the lower-right corner of the cells, drag either to the right or down to select empty cells. Then select either Fill/Copy Down or Fill/Copy Right from the Edit – Fill/Standardize Block submenu (or the spreadsheet shortcut menu). Both options work in a manner similar to Microsoft Excel’s Fill Right and Fill Down facility (also available in Excel from the Edit menu).
How does Extrapolate (AutoFill) work?
A horizontal or vertical series in a block can be extrapolated by dragging the block’s Fill Handle (a small solid black box located on the lower-right corner of the block). After selecting the block, point at the Fill Handle. When the mouse pointer changes to a plus sign, drag to the last cell(s) to which you want to extrapolate the data. STATISTICA can create series of values such as sequential numbers, linear extrapolations, and dates (e.g., you can extend a series such as 1, 2, 3 to include 4, 5, 6).
You can extrapolate (AutoFill) a block in the following ways:
- If the initial selection contains repeated values, these values are duplicated in the extended block.
- Non-identical values are extended by linear trend regression to compute extrapolated values for the series.
- If the block contains the same text labels (i.e., all the cells have the text label ‘Gender’), this text label is copied into the extended block. If the block contains several different text labels, then the text labels (beginning with the numeric value of the last text label in the block) are extended by linear regression to compute extrapolated values for the series.
- If a variable in the block contains dates (i.e., Date is selected in the Display format group box in the Variable specifications dialog), then based on the pattern of dates, STATISTICA extrapolates the dates appropriately.
- If a variable in the block contains text names of months (e.g., Jan, Feb, Mar,), days (e.g., Mon, Tue, Wed,), or quarters (e.g., Q1, Q2,), then STATISTICA extrapolates either up, down, or to the left or right of the block the rest of the names in the series (e.g., Apr, May, …, Dec; Thr, Fri, Sat, Sun; or Q3, Q4, respectively). Note that it does not matter whether upper case or lower case letters are used, however they need to be consistent throughout the series; e.g., use Jan, Feb, Mar, …; not Jan, FEB, mar, ….
- You can also create your own custom list of values to use in extrapolating values. Select Custom Lists in the left pane of the Options dialog (accessible by selecting Options from the Tools menu.
How do I standardize values in a block?
Select a block of values in the spreadsheet, right-click, and select either Fill/Standardize Block – Standardize Columns or Fill/Standardize Block – Standardize Rows from the shortcut menu. The standardized values are computed as follows:
Standard Value = (raw value – mean of highlighted row/column)/standard deviation.
You can also standardize selected variables by selecting Standardize from the Data menu to display the Standardization of Values dialog. This procedure operates independent of the currently selected block, but takes into account the current case selection conditions and weights.
How do I select variables for an analysis?
Every analysis definition dialog in STATISTICA contains at least one Variables button that is used to specify variables to be analyzed. Click the button (or press v) to display the variable selection dialog. If you forget to specify variables and click the OK button to start the analysis, STATISTICA prompts you to specify the variables to be analyzed. The variable selection dialog supports various ways of selecting variables, and it offers various shortcuts and options to review the contents of the data file.
Can variables be selected for analyses by highlighting them in the spreadsheet?
Yes. If you select a block in the spreadsheet, the variables included in the block are automatically preselected for the next analysis. This shortcut is designed to limit the chance of producing unintended results in the following ways:
- The preselection of variables by marking a block in the spreadsheet works only as long as you have not selected a specific list of variables for the analysis (i.e., it never overwrites your previous choices; STATISTICA remembers your previous choices).
- If the variables in the block are not what you intend to analyze, you do not need to “undo” the selection. When you display the variable selection dialog, the list of preselected variables is selected, but the first click of the mouse clears the previous range (unless you keep the CTRL key pressed). If you prefer to use the keyboard to specify the list, then the first (non-cursor moving) key you press deletes the previous entry in the variable selection edit field.
How do I select a subset of cases (observations) to be included in an analysis?
Before an analysis begins (i.e., before the data are processed), you can instruct STATISTICA to select only cases (i.e., rows in the spreadsheet) that meet specific selection criteria. A facility to define and manage case selection conditions can be accessed by selecting Edit from the Tools – Selection Conditions menu or by clicking on the status bar field Sel, which shows the current status of the Case Selection Conditions. Note that when case selection conditions are disabled, this field reads Sel:OFF, when they are enabled, it reads Sel:ON).
Note that there are two types of case selection conditions in STATISTICA – permanent (saved with the data file) and temporary (associated with the current analysis only).
How are Case Selection Conditions stored and saved?
Global case selection conditions (definitions of subsets of data) are attached to a spreadsheet and are called Spreadsheet Case Selection Conditions. They are accessed from the Tools – Selection Conditions menu or by clicking the status bar field Sel. Analysis/Graph Case Selection Conditions can be locally connected to an analysis or graph. They are accessed by clicking the Select Cases button that is available on all analysis/graph definition dialogs. You can save and open either type of selection conditions (and maintain libraries of case selection conditions) using the Open or Save As buttons in the appropriate Case Selection Conditions dialog. Note that case weights are specified in a similar manner.
What is the quickest way to review basic descriptive statistics for a variable?
You can either select Descriptive statistics from the Basic Statistics and Tables Startup Panel (accessible from the Statistics menu) or click the Values/Stats button in the Variable specifications dialog. To display the Variable specifications dialog, double-click on the variable name in the spreadsheet. Alternatively, right-click on the variable name and select Variable Specs from the shortcut menu.
When you click the Values/Stats button, the Values/Stats dialog displays information about the selected variable, a sorted list of its values, and descriptive statistics. The descriptive statistics can be copied to the Clipboard by clicking the Copy button (located in the lower-right corner of the dialog).
What is the quickest way to review a sorted list of all unique values of a variable?
Click the Values/Stats button in the Variable specifications dialog to display a sorted list of all unique numeric values and corresponding text labels (if there are any) for that variable. The Variable specifications dialog is accessed by double-clicking the variable name in the spreadsheet.
Note that the assignments between the numeric values and text labels for the variable can be edited in the Text Labels Editor.
What are statistics of block data?
Statistics for each row or column in a selected block can be computed and added to the spreadsheet (i.e., appended at the end) by selecting the desired Statistics of Block Data from the shortcut menu.
For example, the selection shown above computes the sums for each row in the selected block. The selected statistic (Sum) is displayed in a new spreadsheet.
If you select Block Columns, the results of the selected statistic are appended as a new case, and the case name contains the name of the selected statistic and the range description, as shown below.
In addition, a (customizable) list of predefined graphs is also available from this shortcut menu.
These Graphs of Block Data enable you to visualize the data in the selected block either by row or by column.
Can matrix data be used for input instead of raw data?
Yes. STATISTICA supports a variety of matrix file types as input data (e.g., for Multiple Regression, Factor Analysis, Reliability/Item Analysis, and other modules). Matrix files can be edited in the spreadsheet (as if they were raw data files); however, in order to be properly interpreted as matrices, they need to meet specific content and format conditions (depending on the matrix type). STATISTICA also supports the multiple matrix data file format for input in procedures (such as Structural Equation Modeling) that support matrix data for groups. Note that matrix data that meet the specific content and format conditions are automatically saved in STATISTICA Matrix Spreadsheet (*.smx) format.
How do I take a subset of cases defined by a case selection condition and create graphs where the Case Selection Condition is explicitly stated in the title of the graphs?
Select Cases buttons are available on all analysis and graph specification dialogs. Click this button to display the Analysis/Graph Case Selection Conditions dialog, which contains options to create conditions for which cases will be included (or excluded) in the current analysis. By default, whenever case selection conditions are used, a subtitle is appended to the graph that displays the selection condition. Options controlling the display of selection conditions in graphs are located in the Analyses/Graphs: Display options pane of the Options dialog, accessible by selecting Options from the Tools menu.