CRISP: Data Mining Session
Hello fellow statistical newbs, as well as the better versed. In the last entry we reviewed “Learning Data Mining: Session 1.” In that session, Jennifer taught us that data mining projects can be either supervised learning, where a specific target variable is used such as in classification or regression type projects, or unsupervised learning such as clustering. Unsupervised learning can encompass more than just clustering. Overall, we learned that data mining simply helps find meaning and value within heaps of information. The series will continue to describe the entire data mining process.
In this second video in the session, Jennifer discusses the standard data mining process known as CRISP. CRISP stands for Cross Industry Standard Process for data mining. Until CRISP, there was no standardized process, leaving people to come up with their own processes. “The pioneers of the field collaborated to make a standardized process. CRISP is applicable in any industry, using any data mining software. (Jennifer Thompson)” The CRISP process has helped to make data mining projects faster, more efficient and more cost effective.
STATISTICA Data Miner offers the tools
, Data Miner Recipes, Data Miner Workspace, and nteractive Data Miner. Data Miner Recipes lays out the steps of a data mining project using CRISP. Data Miner Workspace provides a structured work flow for data mining projects using CRISP. Interactive Data Miner is a dialog-driven approach to CRISP.
Next, the video goes into the process of using CRISP.
The first step, business understanding, is critical. During this step, the goals of the project are defined. One needs to determine what can be learned from the data. What questions can be answered? What business objectives can be met? It is important to define a clear plan in order to gauge the success of the project.
The next step is data understanding – accessing, collecting, and exploring data. This step requires professionals in the field. With a clear understanding of the business goals, the data are explored. The key here is to find relationships in data that trigger business understanding. Goals and hypotheses for the project are defined by looking at the interrelationship between business and data understanding.
Data preparation is the next and most time consuming part of the process – sometimes taking up to 80% of the project efforts. Data preparation includes cleaning the data or taking out unnecessary data (re-coding outliers, handling missing data).
After the data is collected, explored, and cleaned, it must be displayed. There are many ways to model your data, which will be discussed in future sessions. Once the model(s) have been created, they must then be reviewed. Evaluation of the models is necessary to determine which best reflect the business goals of the project. In this phase, we determine how to use our models.
By the deployment phase, we should have a model to best meet our business objective. Deployment uses the model to score new data
. and make final predictions. The next session will focus on examples of how one can use CRISP.