Monthly Archives: June 2013
Hello fellow statistical newbs, as well as the better versed. In the last entry we reviewed “Learning Data Mining: Session 1.” In that session, Jennifer taught us that data mining projects can be either supervised learning, where a specific target variable is used such as in classification or regression type projects, or unsupervised learning such as clustering. Unsupervised learning can encompass more than just clustering. Overall, we learned that data mining simply helps find meaning and value within heaps of information. The series will continue to describe the entire data mining process.
In this second video in the session, Jennifer discusses the standard data mining process known as CRISP. CRISP stands for Cross Industry Standard Process for data mining. Until CRISP, there was no standardized process, leaving people to come up with their own processes. “The pioneers of the field collaborated to make a standardized process. CRISP is applicable in any industry, using any data mining software. (Jennifer Thompson)” The CRISP process has helped to make data mining projects faster, more efficient and more cost effective.
STATISTICA Data Miner offers the tools
, Data Miner Recipes, Data Miner Workspace, and nteractive Data Miner. Data Miner Recipes lays out the steps of a data mining project using CRISP. Data Miner Workspace provides a structured work flow for data mining projects using CRISP. Interactive Data Miner is a dialog-driven approach to CRISP.
Next, the video goes into the process of using CRISP.
The first step, business understanding, is critical. During this step, the goals of the project are defined. One needs to determine what can be learned from the data. What questions can be answered? What business objectives can be met? It is important to define a clear plan in order to gauge the success of the project.
The next step is data understanding – accessing, collecting, and exploring data. This step requires professionals in the field. With a clear understanding of the business goals, the data are explored. The key here is to find relationships in data that trigger business understanding. Goals and hypotheses for the project are defined by looking at the interrelationship between business and data understanding.
Data preparation is the next and most time consuming part of the process – sometimes taking up to 80% of the project efforts. Data preparation includes cleaning the data or taking out unnecessary data (re-coding outliers, handling missing data).
After the data is collected, explored, and cleaned, it must be displayed. There are many ways to model your data, which will be discussed in future sessions. Once the model(s) have been created, they must then be reviewed. Evaluation of the models is necessary to determine which best reflect the business goals of the project. In this phase, we determine how to use our models.
By the deployment phase, we should have a model to best meet our business objective. Deployment uses the model to score new data
. and make final predictions. The next session will focus on examples of how one can use CRISP.
For those working with general linear models, this warning message may look familiar:
A lot is going on in this message. What it often boils down to is the units and scale of the data is getting in the way of a good model. I think of this message occurring when one X variable has a wide range, and therefore larger variance to match and another x variable has a much smaller scale and range with small variance. The scale of the data gets in the way of the math behind the scenes and can create regression models that are unstable.
When this message is displayed, my first consideration is a simple scaling of data. Say I have a variable on a scale of 100,000 to 5,000,000. Simply dividing this column by 1,000 or 10,000 perhaps is all I would need to avoid this message and get a good, stable regression model. Alternatively, the measures on a scale of 0.00025 to 0.001 for example could be multiplied by 100 to even out the scale as well. This article should help you get started in creating a variable and using spreadsheet functions.
Not sure which variable is the offender? Use descriptive statistics to compute the variance of your set of X predictor variables to find out. The check performed by GLM compares the ratio of smallest variance to largest. We could sort the descriptive statistics output by the variance column and quickly see not only the highest and lowest variance variables, but also any others that have a similar variance that might need scaled.
Of course, you can go the other route the message suggests; increase sweep delta. This changes the threshold for the test checking for a variance issue in the set of X predictor variables. Making this change does not affect the computations that produce a regression equation. It only affects the threshold for the variance stability check.
DAAD-NRF IN-COUNTRY MASTERS AND DOCTORAL SCHOLARSHIP
CALL FOR APPLICATIONS FOR 2014
Deadline is the 21 August 2013 for funding for 2014
The DAAD / NRF are pleased to announce its Joint In-Country Scholarship programme for Masters and PhD students studying at South African universities in 2014. The programme is based on a partnership between DAAD and the NRF. The In-Country Scholarship Programme offers a maximum of 100 new scholarship awards for Masters’ and Doctoral candidates per annum.
The application deadline is the 21 August 2013 for funding for 2014. Applicants must apply online at: https://nrfsubmission.nrf.ac.za and follow the application procedure set out in the call document.
We would very much appreciate it if you could advertise the programme and disseminate the information within your institution.
Should you have any queries with regards to the scholarship requirements or application procedure, please do not hesitate to contact us on the details below:
DAAD Information Centre Johannesburg
Ms Kerynn Dahl
NRF Scholarships and Fellowships
Ms Thashni Maistry
For technical problems:
The NRF Supportdesk: 08:00-13:00 and 13:30-16:30
The DAAD and the NRF look forward to receiving your applications.
|Download DAAD-NRF Call 2014 [169.60 KB]|
WebSTATISTICA is offered as a complete solution that includes the analytic functionality of the respective selected STATISTICA product or any combination of STATISTICA products.1
One of the clearest advantages offered by the WebSTATISTICA technology is that it makes the power of any of the STATISTICA family of products conveniently available anywhere by any workstation equipped with an industry-standard Web browser. Thus, WebSTATISTICA add a new dimension and an endless array of new possibilities and applications to the entire line of STATISTICA Data Analysis, Data Mining, Quality Control, and Six Sigma software.
WebSTATISTICA supports multiprocessor environments and works with load balanced environments, making WebSTATISTICA suitable for internal cloud computing environments.
Two Common Categories of Web-based Analytics
1) Custom Web-based applications
WebSTATISTICA support one or more customized Web-based analytic applications to suit an organization’s specific needs. Users log in and see a highly-targeted user interface customized for the particular application needs. Users have single-click access to the desired set of queries, analysis results, and reports, all displayed within their Web browser.
2) Interactive Statistical Application Deployed Enterprise-wide (across a Wide Area Network)
The full power of STATISTICA analytics1 is available via the server-based, Wide Area Network (WAN) architecture, providing all of the advantages of no client software to install, central configuration and ongoing management, increased scalability and performance, and highly-interactive user experience.
For example, the most recent data and reports (e.g., updated via queries to the specific parts of the corporate data warehouse) – with options to interactively drill down into the results and interactively obtain additional, specific insights about the business – can now be made available to authorized employees wherever they are and regardless of the type of computers to which they have access. Wherever there is the Internet (which means virtually …everywhere), there is now also access to the query, reporting, and analytic tools of the most comprehensive data analysis system available.
Enterprise-wide Collaborative Web-based Products
WebSTATISTICA Server acts as a core of an enterprise-wide network system allowing the participants to work collaboratively, quickly share results (reports), as well as scripts of analyses or queries. User or group permissions can be used by the administrators to manage access of specific groups of users to specific data or reports. The accessibility of its tools via the Internet makes WebSTATISTICA Server a perfect system to facilitate collaborative projects of employees working at different locations or branches of a corporation (even on different continents), or employees who are telecommuting or traveling.
WebSTATISTICA Knowledge Portal – is a powerful, Web-based, knowledge-sharing tool that allows your colleagues, employees, and/or customers (with appropriate permissions) to log in and quickly and efficiently get access to the information they need, by reviewing predefined reports.
WebSTATISTICA Interactive Knowledge Portal – offers to the portal visitors all the functionality of the Knowledge Portal and additional options. These options include allowing the user to define and request new reports, run queries and custom analyses, drill down and up, slice/dice data, and gain insight from all resources that are made available to them by the portal designers or administrators.
STATISTICA Enterprise Web Viewer provides the ability to view analyses and reports that were generated within STATISTICA Enterprise or STATISTICA Enterprise / QC. This allows companies to protect their data and reports with the STATISTICA Enterprise security model.
1 99% of the functionality of STATISTICA is supported in WebSTATISTICA
STATISTICA Text Miner and WebCrawler are tools that make a textual website analysis quick and easy. In this case study, Jennifer Thompson compares the content of several national news provider websites to determine what, if anything, distinguishes them from one another. Of course, she finds some interesting results…