Monthly Archives: May 2011

Business Intelligence – Solve a Critical Quality Problem

root cause analysis with predictive analyticsThis is a continuation of  Predictive Analytics – Solve a Critical Quality Problem. A BioPharmaceutical Manufacturing company was scrapping about 30% of batches, which is very expensive. The company’s engineers tried to solve the problem with various techiques.

But it was not until they started using predictive analytics (also know as data mining) that they uncovered actionable process improvements. These improvements are predicted to lower the scrap rate from around 30% to 5%.

How were these improvements discovered?

The Data Mining Approach for Root Cause Analysis: Data mining is a broad term used in a variety of ways, in addition to other terms such as “predictive modeling” or “advanced analytics.”

Here, it means the application of the latest data-driven analytics to build models of a phenomenon, such as a manufacturing process, based on historical data. In a nutshell, in the last 10-15 years, there has been a great leap forward in terms of the flexibility and ease of building models and the amount of data that can be utilized efficiently due to advances in computing hardware.

Data mining has changed the world of analytics

… in a good way.

… forever.

Companies that embrace these changes and learn to apply them will benefit.

Data mining begins with the definition and aggregation of the relevant data. In this case, it was the last 12 months of all the data from the manufacturing process, including:

  • raw materials characteristics
  • process parameters across the unit operation for each batch
  • product quality outcomes on the critical-to-quality responses on which they based their judgment about whether to release the batch or scrap it

Once the relevant data were gathered, StatSoft consultants sat down with the engineering team before we began the model building process. This is a critical step and one that you should consider as you adopt data mining.

We asked the engineers questions such as:

  • Which factors can you control, and which ones can you not control?
  • Which factors are easy to control, and which ones are difficult or expensive to control?

The rationale is that data mining is not an academic exercise when applied to manufacturing. It is being done to improve the process, and that requires action as the end result. A model that is accurate but based solely on parameters that are impossible or expensive to tweak is impractical (which is a nice way of saying ― useless).

Empowered with this information, model building is the next step in the data mining process. In short, many data mining model types are applied to the data to determine which one results in the optimal goodness of fit, such as the smallest residuals between predicted and actual values.

Various methods are employed to ensure that the best models are selected. For example, a random hold-out sample of the historical data is used for each model to make predictions. This helps protect against the potential for the model to get very good at predicting one set of historical data to the point at which it is really bad at predicting the outcomes for other batches.

A major advantage of data mining is that you don‘t need to make assumptions ahead of time about the nature of the data and the nature of the relationships between the predictors and the responses. Traditional least squares linear modeling, such as what is taught in Six Sigma classes on the analytic tools, does require this knowledge.

For Root Cause Analysis, most data mining techniques provide importance plots or similar ways to see very quickly which raw materials and process parameters are the major predictors of the outcomes, and, as valuable, which factors don‘t matter.

root cause analysis BI

At this point in the data mining process, StatSoft consultants sat down with the engineering team to review the most important parameters. Typically, there is an active discussion with comments from the engineers such as:

  • that can‘t be
  • I don‘t see how that parameter would be relevant

The conversation gradually transforms over the course of an hour to:

  • I could see how those parameters could interact with the ones later in the process to impact the product quality

Data mining methods are really good at modeling large amounts of data from lots of parameters, a typical situation in manufacturing. Humans are good at thinking about a few factors at a time and interpreting a limited time window of data.

As shown above, the two approaches complement each other, with the results from data mining as important insights about the manufacturing process that can then be evaluated, validated, and utilized by the engineering team to determine:

  • Now, what do we do to improve the process? What are the priorities?

The company then planned to implement Process improvements that are predicted to lower the scrap rate of batches from ~30% to ~5%!

Note: To get from root cause analysis to process improvements, the models were used for optimization (another data mining technique).

Advertisements

StatSoft Southern Africa continues to build relationships where it matters most — in the classroom.

We are committed to delivering software solutions and services for academia that will spark innovation and expand educational opportunities around the world.

STATISTICA is used by academia around the world.


 STATISTICA for Professors and Students

Individual Professors and Students

Academic Solutions for Professors and Students are designed for individual use and are best suited with the desktop version of STATISTICA. To learn more about the Desktop Version of STATISTICA or request a quote, please follow the links below. 

STATISTICA Desktop


 STATISTICA for Classrooms, Departments and Campuses

Entire Classrooms and Campus-Wide

StatSoft provides licensing agreements to Universities that wish to license STATISTICA for an entire classroom or even campus-wide. To learn more about these solutions, please contact Academic Sales at 0112346148 or email us at info@statsoft.co.za

WebSTATISTICA

To Request a quote please contact us at: sales@statsoft.co.za / info@statsoft.co.za or alternatively call 0112346148 for more information.

Statistica Fraud Detection

Fraud Detection

  • Overview
  • “Fraud” vs. “Erroneous” Claims, Information, etc.
  • Fraud Detection as a Predictive Modeling Problem
  • Predicting Rare Events
  • Fraud Detection as Anomaly Detection, Intrusion Detection
  • Rule Engines and Predictive Modeling
  • Text Mining and Fraud Detection

Overview

Fraud detection is a topic applicable to many industries including banking and financial sectors, insurance, government agencies and law enforcement, and more.  Fraud attempts have seen a drastic increase in recent years, making fraud detection more important than ever. Despite efforts on the part of the affected institutions, hundreds of millions of dollars are lost to fraud every year.  Since relatively few cases show fraud in a large population, finding these can be tricky.

In banking, fraud can involve using stolen credit cards, forging checks, misleading accounting practices, etc.  In insurance, 25% of claims contain some form of fraud, resulting in approximately 10% of insurance payout dollars.  Fraud can range from exaggerated losses to deliberately causing an accident for the payout.  With all the different methods of fraud, finding it becomes harder still.

Data mining and statistics help to anticipate and quickly detect fraud and take immediate action to minimize costs. Through the use of sophisticated data mining tools, millions of transactions can be searched to spot patterns and detect fraudulent transactions.

An important early step in fraud detection is to identify factors that can lead to fraud. What specific phenomena typically occur before, during, or after a fraudulent incident?  What other characteristics are generally seen with fraud?  When these phenomena and characteristics are pinpointed, predicting and detecting fraud becomes a much more manageable task.

Association Rules graph for Fraud DetectionUsing sophisticated data mining tools such as decision trees (Boosting trees, Classification trees, CHAID and Random Forests),  machine learning,  association rules, cluster analysis and neural networks , predictive models can be generated to estimate things such as probability of fraudulent behavior or the dollar amount of fraud.  These predictive models help to focus resources in the most efficient manner to prevent or recuperate fraud losses.

“Fraud” vs. “Erroneous” Claims, Information, etc.

The notion of “fraud” implies an intention on the part of some party or individual presumably planning to commit fraud. From the perspective of the target of that attempt, it is usually less important whether or not intentional fraud has occurred, or some erroneous information was introduced into the credit system or process evaluating insurance claims etc. So from the perspective of the credit, retail, insurance, or similar business the issue is rather whether or not a transaction that will be associated with loss has occurred or is about to occur, if a claim can be subrogated, rejected, or funds recovered somehow, etc.

While the techniques briefly outlined here are often discussed under the topic of “fraud detection”, other terms are also frequently used to describe this class of data mining (or predictive modeling; see below) application, as “opportunities for recovery”, “anomaly detection”, or using similar terminology.

From the (predictive) modeling or data mining perspective, the distinction between “intentional fraud” vs. “opportunities for recovery” or “reducing loss” is also mostly irrelevant, other than that the specific perspective of how losses occur may guide the search for relevant predictors (and databases where to find relevant information). For example, intentional fraud may be associated with unusually “normal” data patterns as intentional fraud usually aims to stay undetected – and thus hide as an average/common transaction; other opportunities for recovery of loss (other than intentional fraud), however, may simply involve the detection of duplicate claims or transactions, the identification of typical opportunities for subrogation of insurance claims, correctly predicting when consumers are accumulating too much debt, and so on.

In the following paragraphs, the “fraud” term will be used as a short hand to reference the types of issues briefly outlined above.

Fraud Detection as a Predictive Modeling Problem

One way to approach the issue of fraud detection is to consider it a predictive modeling problem, of correctly anticipating a (hopefully) rare event. If historical data are available where fraud or opportunities for preventing loss have been identified and verified, then the typical useful predictive modeling workflow can be directed at increasing the chances to capture those opportunities.

In practice, for example, many insurance companies support investigative units, to evaluate opportunities for saving money on claims that were submitted. The goal is to identify a screening mechanism so that the expensive detailed investigation into claims (requiring highly experienced personnel) is selectively applied to claims where the overall probability for recovery (detecting fraud, opportunities to save money, etc.; see the introductory paragraphs) is generally high. Thus, with an accurate predictive model for detecting likely fraud, subsequent “manual” resources required to investigate a claim in detail are generally more likely to reduce loss.

Predicting Rare Events

The approach to predicting the likelihood of fraud as described above essentially comes down to a standard predictive modeling problem. The goal is to identify the best predictors and a validated model providing the greatest Lift to maximize the likelihood that the observations predicted to be fraudulent will indeed be associated with fraud (loss). That knowledge can then be used to reject applications for credit, or to initiate a more detailed investigation into an insurance claim, credit application, purchase via credit card, etc.

As most types of fraud are sporadic events (less than 30% of cases are fraud), the stratified sampling technique can be used to oversample from the fraudulent group.  This technique aids in model building.  With more cases from the group of interest, data mining models are better able to find the patterns and relationships to detect fraud.

Depending on the base rate of fraudulent events in the training data it may be necessary to apply appropriate stratified sampling strategies to create a good data set for model building, i.e., a data file where fraudulent vs. non-fraudulent observations are represented with approximately equal probability (as described in stratified random sampling, model building is usually easiest and most successful when the data presented to the learning algorithms include exemplars of all relevant classes with about equal proportions;).

Fraud Detection as Anomaly Detection, Intrusion Detection

Another use case and problem definition of “fraud detection” presents itself rather as an “intrusion” or anomaly detection problem. Such cases arise when there is no good training (historical) data set that can be unambiguously assembled where known fraudulent and non-fraudulent observations are clearly identified.

For example, consider again the simple case of an insurance use case. A claim is filed against a policy, which given existing procedures (and rule engines, see below) triggered a further investigation that resulted in some recovery for the insurance company in a small proportion of cases. If one were to assemble a training dataset of all claims, some of which were further investigated and some recovery occurred or perhaps fraud was uncovered, then any modeling of such a dataset would likely capture to a large extent the rules and procedures that led to the investigation in the first place. (However, perhaps a more useful training dataset could be constructed only from those data referred to the investigative unit for further evaluation.). In other common cases, there is no “investigative unit” in the first place, and the data available for analysis do not contain a useful indicator of fraud or potentially recoverable loss, or potential savings.

In such cases, the available information simply consist of a large and often complex data set of claims, applications, purchases, etc. with now clear outcome “indicator variable” that would be useful for predictive modeling (and supervised learning). In those cases, another approach is to effectively perform unsupervised learning to identify in the data set (or data stream) “unusual observations” that are likely associated with fraud, unusual conditions, etc.

For example, consider the typical health insurance case. A large number of very (in fact extremely) diverse claims are filed, usually encoded via a complex and rich coding scheme to capture various health issues and common and “approved” or “accepted” therapies. Also, with each claim there can be the expectation of obvious subsequent claims (e.g., a hip replacement requires subsequent rehabilitation), and so on.

Anomaly Detection

The field of anomaly detection has many applications in industrial process monitoring, to identify “outliers” in multivariate space that may indicate a process problem. A good example of such an application for monitoring multivariate batch processes is discussed in the chapter on Multivariate Process Monitoring for batch processes, using Partial Least Squares methods. The same logic and approach can fundamentally be applied for fraud detection in other (non-industrial-process) data streams.

To return to the example of a health care, assume that a large number of claims are filed and entered into a database every day. The goal is to identify all claims where reduced payments (less than the claim) are due, including outright fraudulent claims. How can that be achieved?

A-priori rules.

First, obviously there are a set of complex rules that should be applied to identify inappropriately filed claims, duplicate claims and so on. Typically, complex rules engines are in place that will filter all claims to verify that they are formally correct, i.e., consistent with the applicable policies and contracts. Duplicate claims will also have to be checked.

What remains are formally legitimate claims which nonetheless could (and probably do) contain fraudulent claims. To find those it is necessary to identify any configurations of data fields associated with the claims that would allow us to separate the legitimate claims from those that are not. Of course, if no such patterns exist in the data, then nothing can be done; however, if such patterns do exist then the task becomes to find those “unusual” claims.

The usual and unusual.

There are many ways to define what might constitute an “unusual” claim. But basically there are two ways to look at this problem: Either by identifying outliers in the multivariate space, i.e., unusual combinations of data fields that are unlike typical claims, or by identifying “in-liers”, that is, claims that are “too typical”, and hence suspect of having been “made up”.

How to detect the usual and unusual.

This task is one of unsupervised learning. The basic data analysis (data mining) approach is to use some form(s) of clustering methods (e.g., k-means clustering, and then use those clusters to score (assign) new claims: If a new claim cannot be assigned with high confidence to a particular cluster of points in the multivariate space made up of numerous parameters (information available with each claim) then the new claim is “unusual” and an outlier of sorts, and should be considered for further evaluation; if a new claim can be assigned to a particular cluster with very-high confidence, and perhaps, if a large number of claims from a particular source all share that characteristic (i.e., are “in-liers”), then again these claims might warrant further evaluation since they are uncharacteristically “normal”.

Anomaly detection, intrusion detection.

It should be noted that similar techniques are useful in all applications where the task is to identify atypical patterns in data, or patterns that are suspiciously too typical. Such use cases exist in the area of intrusion (to networks) detection, as well as many industrial multivariate process monitoring applications where complex manufacturing processes involving a large number of critical parameters must be monitored continuously to ensure overall quality and system health.

Rule Engines and Predictive Modeling

The previous paragraphs briefly mentioned rule engines as one component in fraud detection systems. In fact, they typically are the first and most critical component: Usually, the expertise and experience of domain experts can be translated into formal rules (that can be implemented in an automated scoring system) for pre-screening data for fraud or the possibility of reduced loss. Thus, in practice, the fraud detection analyses and systems based on data mining and predictive modeling techniques serve as the method for further improving the fraud detection system in place, and their effectiveness will be judged against the default rules created by experts. This also means that the final deployment method of the fraud detection system, e.g., in an automated scoring solution, needs to accommodate both sophisticated rules and possibly complex data mining models.

Text Mining and Fraud Detection

In recent years, text mining methods are increasingly used in conjunction with all available numeric data to improve fraud detection systems (e.g., predictive models). The motivation simply is to align all information that can be associated with a record of interest (insurance claim, purchase, credit application), and to use that information to improve the predictive accuracy of the fraud detection system. Basically, the approaches described here are applicable in the same way when used in conjunction with text mining methods, except that the respective unstructured text sources would first have to be pre-processed and “numericized” so that they can be included in the data analysis (predictive modeling) activities.

STATISTICA Scorecard

STATISTICA Scorecard, a software solution for developing, evaluating, and monitoring scorecard models, includes the following capabilities and workflow:

Credit Scoring Webinar 

  • Data Preparation
    • Feature Selection
    • Attributes Building 
  •  Modeling
    • Scorecard Building
    • Survival Models
    • Reject Inference 
  • Evaluation and Calibration
    • Model Evaluation
    • Cutoff Point Selection
    • Score Cases 
  • Monitoring
    • Population Stability
  • Feedback
    • Comments from STATISTICA Scorecard users

Feature Selection

The Feature Selection module is used to exclude unimportant or redundant variables from the initial set of characteristics. You can create a variable ranking using two measures of overall predictive power of variables: IV (Information Value) and Cramer’s V. Based on these measures, you can identify which characteristics have important impact on credit risk and select them for the next stage of model development. Moreover, the Selecting representatives option enables you to identify redundancy among numerical variables without analyzing the correlation matrix of all variables. This module creates bundles of commonly correlated characteristics using factor analysis with rotation of scores. In each bundle, variables are highly correlated with the same factor (and often with each other) so you can easily select only a small number of bundle representatives.

STATISTICA Credit Scorecard Builder

Attributes Building

In the Attributes Building module, you can prepare risk profiles for every variable. Using an automatic algorithm (based on the CHAID method) or a manual mode, you can divide variables (otherwise known as characteristics) into classes (attributes or “bins”) containing homogenous risks. Initial attributes can be adjusted manually to fulfill business and statistical criterions such as profile smoothness or ease of interpretation. To build proper risk profiles, statistical measures of the predictive power of each attribute (WoE – Weight of Evidence, and IV – Information Value) are generated. The quality of the WoE can be assessed for each attribute using a graph of Weight of Evidence (WoE) trend. The whole process can be saved as an XML script, and can be used later in the Credit Scorecard Builder module.

Credit Scorecard Builder

The Credit Scorecard Builder module is used to create a scorecard based on attributes prepared in the Attributes Building module and logistic regression model. The process from data to scorecard can be simplified by accepting the default parameters. Advanced users may recode initial variables into attributes (WoE or sigma-restricted dummy variables) and choose one of the model building methods:

  • Forward entry,
  • Backward elimination,
  • Forward step-wise,
  • Backward step-wise,
  • Best subset,
  • Bootstrap for all effects.

STATISTICA Credit Scorecard Builder

Once a model is built, a set of statistics (such as AIC, BIC, LR tests) and reports (such as the eliminated unimportant variables) can be generated. The final stage of this process is scorecard preparation, using a logistic regression algorithm to estimate model parameters and specified scale values to transform the model into a scorecard format, after which it can be saved as Excel, XML, or SVB script.

Survival Models

Survival Models is used to build scoring models using Cox Proportional Hazard Model. You will be able to estimate a scoring model using additional information about the time of default (when the debtor stopped paying). Based on this module, you can calculate the probability of default (scoring) in given time (for example, after 6 months, 9 months, etc.)

Reject Inference

In some circumstances, there is a need to take into consideration cases where the credit applications were rejected. Because there is no information about output class (good or bad credit) of rejected cases, this information will be garnered using an algorithm – the k-nearest neighbors method and parceling method are available. After analysis, a new data set with complete information is produced.

STATISTICA Credit Scorecard Builder

Model Evaluation

The Model Evaluation module is used to evaluate and compare different scorecard models. To assess models, you can select the following statistical measures (each with full detailed report)

  • Information Value (IV),
  • Kolmogorov – Smirnov statistic (with respective graph),
  • Gini index,
  • Divergence,
  • Hosmer – Lemeshow statistic,
  • ROC curve analysis,
  • Lift and Gain chart.

STATISTICA Credit Scorecard Builder

Additional reports include:

  • Final score report,
  • Characteristic report,
  • Odds chart,
  • Bad rate chart.

Then you can assess goodness-of-fit of generated models and choose one that fulfills your expectations prior to creating the scorecard model.

Cutoff Point Selection

Cutoff Point Selection is used to define the optimal value of scoring to separate accepted and rejected applicants. You can extend the decision procedure by adding one or two additional cut-off points (for example applicants with scores below 520 will be declined, applicants with scores above 580 will be accepted, and applicants with scores between these values will be asked for additional qualifying information). Cut-off points can be defined manually, based on an ROC analysis for custom misclassifications costs and bad credit fraction (ROC – Receiver Operating Characteristic – provides a measure of the predictive power of a model). Additionally, you can set optimal cutoff points by simulating profit associated with each cut-point level. Goodness of the selected cut-off point can be assessed based on various reports.

STATISTICA Credit Scorecard Builder

Score Cases

The Score Cases module is used to score new cases using the selected model saved as an XML script. You can calculate overall scoring, partial scorings for each variable, and probability of default (from logistic regression model), adjusted by an a priori probability of default for the whole population (supplied by the user).

Population Stability

The Population Stability module provides analytical tools for comparing two data sets (for example, current and historical data sets) in order to detect any significant changes in characteristics structure or applicants population. Significant distortion in the current data set may signify the need to re-estimate parameters of the model. This module produces reports of population and characteristics stability with respective graphs.

 

Comments from STATISTICA Scorecard Users:

Millennium Bank has used STATISTICA effectively for more than ten years to support their data analysis. They find the program to be comprehensive and user friendly. Based on their experience with STATISTICA, the Credit Risk Department decided to extend their analytic capabilities with scorecard development and maintenance solutions using STATISTICA Scorecard. It is a crucial element in effective credit risk management, which is growing in importance throughout the financial industry. STATISTICA Scorecard is a popular and reliable tool that is well known in the Polish market. We are very pleased that Millennium Bank has joined the group of satisfied STATISTICA Scorecard users.

“We consider STATISTICA Scorecard to be highly useful in everyday development of credit risk models. Millennium Bank counts StatSoft Polska among one of its most reliable business partners.”

Louis Paul
Head of Risk Department
Millennium Bank



“We wish to express our thanks and appreciation for the assistance in employing STATISTICA Scorecard to create and validate scoring models.

“For almost four years, we have been using STATISTICA data analysis software. It is needless to say how user-friendly this environment is to our analysts; nonetheless, implementation of the new STATISTICA Scorecard module helped us to discover new opportunities and dig deeper into the software’s functionalities, not only in developing scoring models, but in other functionalities as well.

“While testing the newest version of STATISTICA Scorecard, StatSoft consultants displayed solid knowledge and deep involvement. Our decision to purchase the software and participate in the ‘Credit Scoring in STATISTICA’ workshop was made with little hesitation. Our use of this software significantly reduced the workload in creating scoring charts and implementing their validation and reporting, and greatly increased the quality of such work.”

“At present, Stefczyk Credit Union is using STATISTICA Scorecard. We can confidently recommend this product to any financial institution (and not only those) that intends to reduce its financial risks and improve its sales process.”

Izabela Rutkowska
Risk Assessment Manager
Stefczyk Credit Union


 

“We would like to thank StatSoft Polska for their aid in implementing software for creating and validating scoring models (STATISTICA Scorecard).

“We must emphasize the fact that our purchase of the software was preceded by a month-long testing period. That period convinced us of the solution’s quality, which unquestionably includes its functionality and ease of use. The implementation of the Scorecard in a quality tool such as STATISTICA undeniably has proved an added asset to our company.

“The STATISTICA Scorecard software not only helps automate the procedures needed to build a scoring model, but also assists in reporting the entire process, which is of particular importance in any financial institution.

“Based on our experience, we can recommend STATISTICA Scorecard by StatSoft to any company for effective credit risk management.”

SKOK im. M. Kopernika
Risk Management Department
Kopernik Credit Union

STATISTICA Credit Scoring

STATISTICA Credit Scoring is the solution for any company to build in-house models for its various credit products and decision-making. STATISTICA Credit Scoring covers all aspects of the credit scoring needs for your company and has been proved by the testimonials from current customers.

  • In House Model Building: The STATISTICA Credit Scoring software solution enables the development & evaluation of predictive models to evaluate and assign a risk to applications for credit, either for a request for a new account or for requested changes (e.g., balance increase) to the terms of an existing credit account. 
  • Scoring Applications: STATISTICA Live Score enables companies to score credit applications, easily integrated with your existing customer service systems, self-service Websites for customers, etc.
  • Evaluate Performance: STATISTICA Credit Scoring provides built-in monitoring and evaluation of the ongoing performance of the models to enable the evaluation of outcomes and key metrics and to make decisions about when models may need to be updated.

What makes the STATISTICA Credit Scoring solution unique?:

  • The Approach: STATISTICA Credit Scoring includes both traditional methods for developing credit scoring models (such as Scorecards based on logistic regression) as well as more advanced methods for predictive modeling that often provide better accuracy, which translates into decreased risk, increased approval rates, and increased profits.
  • Real-time Scoring: The STATISTICA Credit Scoring solution includes STATISTICA Live Score, the solution for enabling scoring decisions directly from customer applications via Customer Service Agents, Websites, and other line of business systems.
  • Sources of Data: Unlike generic scorecards, STATISTICA Credit Scoring can be tailored to meet your specific needs.  For example, it provide the flexibility to include various data sources such as behavior scoring, utilizing the transactional record of the account to inform recommendations for credit line increases, incentives, cross-sell or up-sell, or other changes in terms.
  • Flexibility and Capabilities: STATISTICA Credit Scoring is specific to building credit scoring models but the same approaches and techniques can also be applied to modeling customer churn, increasing the ability to detect fraud, response modeling for marketing campaigns, and other applications within your company.


STATISTICA Scorecard Testimonials:

Millennium Bank has used STATISTICA effectively for more than ten years to support their data analysis. They find the program to be comprehensive and user friendly. Based on their experience with STATISTICA, the Credit Risk Department decided to extend their analytic capabilities with scorecard development and maintenance solutions using STATISTICA Scorecard. It is a crucial element in effective credit risk management, which is growing in importance throughout the financial industry. STATISTICA Scorecard is a popular and reliable tool that is well known in the Polish market. We are very pleased that Millennium Bank has joined the group of satisfied STATISTICA Scorecard users.

“We consider STATISTICA Scorecard to be highly useful in everyday development of credit risk models. Millennium Bank counts StatSoft Polska among one of its most reliable business partners.”

Louis Paul
Head of Risk Department
Millennium Bank



“We wish to express our thanks and appreciation for the assistance in employing STATISTICA Scorecard to create and validate scoring models.

“For almost four years, we have been using STATISTICA data analysis software. It is needless to say how user-friendly this environment is to our analysts; nonetheless, implementation of the new STATISTICA Scorecard module helped us to discover new opportunities and dig deeper into the software’s functionalities, not only in developing scoring models, but in other functionalities as well.

“While testing the newest version of STATISTICA Scorecard, StatSoft consultants displayed solid knowledge and deep involvement. Our decision to purchase the software and participate in the ‘Credit Scoring in STATISTICA’ workshop was made with little hesitation. Our use of this software significantly reduced the workload in creating scoring charts and implementing their validation and reporting, and greatly increased the quality of such work.”

“At present, Stefczyk Credit Union is using STATISTICA Scorecard. We can confidently recommend this product to any financial institution (and not only those) that intends to reduce its financial risks and improve its sales process.”

Izabela Rutkowska
Risk Assessment Manager
Stefczyk Credit Union



“We would like to thank StatSoft Polska for their aid in implementing software for creating and validating scoring models (STATISTICA Scorecard).

“We must emphasize the fact that our purchase of the software was preceded by a month-long testing period. That period convinced us of the solution’s quality, which unquestionably includes its functionality and ease of use. The implementation of the Scorecard in a quality tool such as STATISTICA undeniably has proved an added asset to our company.

“The STATISTICA Scorecard software not only helps automate the procedures needed to build a scoring model, but also assists in reporting the entire process, which is of particular importance in any financial institution.

“Based on our experience, we can recommend STATISTICA Scorecard by StatSoft to any company for effective credit risk management.”

SKOK im. M. Kopernika
Risk Management Department
Kopernik Credit Union

STATISTICA Data Mining, Text Mining and Predictive Analytics Software

Data Mining is the differentiator. Some have labelled the current period, appropriately, as
“The Age of Analytics,” a period in which the information age has led us to the application of
analytics to derive insights from these incredible sources of data. 

At StatSoft, we have the opportunity to collaborate with, consult, and train colleagues in the areas of data analysis and predictive modelling in a variety of industries: automotive manufacturing, financial services, medical device manufacturing, pharmaceutical R&D and manufacturing, semiconductors, etc. What our experience has taught us is that, in a competitive economy, companies can focus on opportunities for utilizing advantages and streamlining. One category of opportunity is to leverage the data that your company has already collected and manages.

Software

The STATISTICA Data Analysis and Data Mining Platform, including the STATISTICA Data Miner software, offers the most comprehensive and effective system of user-friendly tools for the entire data mining process – from querying databases to generating final reports. StatSoft’s data mining and predictive modelling software is available in single workstation, multiple-user (concurrent user licensing), and Enterprise editions.

STATISTICA Text Miner is an optional extension of STATISTICA Data Miner, ideal for translating unstructured text data into meaningful information.

The Enterprise edition provides an efficient server-platform, for off-loading resource-intensive model-building tasks, Web browser-based or Windows workstation clients, and central configurations of queries, analyses, report templates, and models.

STATISTICA Scorecard aids the development, evaluation and monitoring of scorecard models,STATISTICA Live Score is STATISTICA Server software within the STATISTICA Data Analysis and Data Mining Platform. Data are aggregated and cleaned and models are trained and validated using the STATISTICA Data Miner software. Once the models are validated, they are deployed to the STATISTICA Live Score server. STATISTICA Live Score provides multi-threaded, efficient, and platform-independent scoring of data from line-of-business applications.

STATISTICA Process Optimization, an optional extension of STATISTICA Data Miner, is a powerful software solution designed to monitor processes and identify and anticipate problems related to quality control and improvement with unmatched sensitivity and effectiveness.

Services (Consulting, Training)

StatSoft’s Professional Services offer data mining consulting and training. StatSoft offers an efficient ‘Quick Start’ package of training and consulting as an optional addition to the licensing of the STATISTICA Data Miner software, assisting new software users with delivering business value and return-on-investment as quickly as possible after the acquisition of the software. StatSoft’s consultants take a collaborative approach to projects, mapping the scope of services to fit your business priorities and available resources.

Information about Data Mining Methods

Below are useful links to StatSoft’s overviews of popular data mining methods provided in the STATISTICA Data Miner platform:

Association Rules
Classification and Regression Trees
CHAID
Boosting Trees
Cluster Analysis
Support Vector Machines
MARSplines
Naïve Bayesian Classifiers
Text Mining
Partial Least Squares
Independent Components Analysis

Request Academic Statistica

 

Need to purchase statistical analysis software or data mining software for a student, faculty or school? We provide discounted STATISTICA single user licenses for individual professors and students. And we have different discounts available for entire classrooms or universities.

Contact us at sales@statsoft.co.zaor call us on: 0112346148 Let us know what types of problems you are trying to solve or what statistical analyses you are interested in.

Receive an upgrade discount, if you are a past STATISTICA customer. You must provide a serial number to receive the discount. You can find your serial number by starting the STATISTICA application. Select the Help menu. Select About STATISTICA menu.

Contact Details For Academic Quote

E-mail: sales@statsoft.co.za / info@statsoft.co.za

Contact Number: 0112346148

Ethics of Making Graphs

In a few political and data-visualization blogs the past several days, there has been a kerfuffle concerning this bar chart that the Wall Street Journal published. The gist of the chart is that the bulk of the taxable income in this country is earned by households in the $100,000-$200,000 range, and the argument made is that increasing taxes on the richest Americans won’t raise enough money to eliminate the budget deficit.
The liberal-leaning magazine Mother Jones responded with this graph, with the objection that the WSJ’s graph was drawn to imply that the rich weren’t really all that rich.
I don’t like either graph. Neither is particularly useful. But neither one is “wrong.”
Both graphs seem to have been created with the intention of making a political statement. Which is OK for political blogs because we all know that information presented by pundits has the potential to be biased. But what about graphs that are supposed to be objective – ones we see on the news or ones that we send in reports to our bosses? Is there a standard of ethics for making graphs?
Google is surprisingly silent on the issue, at least using the several searches I tried.
Here are a few items I thought of to help us improve the quality of the graphs in our lives.
Why did you categorize your data this way? Are divorced people similar enough to widowed people to represent by the same slice of a pie chart? Are 21-22 year olds more similar to 19-20 year olds or to 23-24 year olds? Did you choose the breakpoints that do the most to strengthen your hypothesis (replacing “strengthen your hypothesis” with “generate hits to your website” or “improve your chances of getting a bonus,” etc., as needed)? Can you back up your choices with research or other examples in your industry?
If you change the scaling of your graphs from the defaults, why? Are you perhaps trying to soften a bad trend or exaggerate a good trend? Will you use the same scaling in your next report even if trends change?
Ask questions when you suspect that someone is using data visualization for a deceptive purpose. Ask them to present the information in several different ways or ask for the raw data set. If you’re still reading this paragraph, I assume you have a certain level of expertise with or interest in graphs (not to mention you’re probably very sophisticated and good-looking). Use your above-average knowledge as a public service to help keep others from getting duped.
I feel obliged to make a graph from the raw IRS dataset  that was used by both the Wall Street Journal and Mother Jones. I used the same groupings as the raw data set, so part of the graph looks like the Wall Street Journal bar graph, but I also included the number of returns filed in each income category in 2008. 
We can see that the number of returns filed peaks in the $50,000 to $75,000 income range, and the total taxable income per income category peaks in the $100,000 to $200,000 range, at least according to the way that the IRS chose to break down the data for this particular table. There are many more returns filed in the lowest income groups than the highest income groups, and there is much more total taxable income in the highest income groups than the lowest income groups.
What “should” be done with tax policy after seeing this graph? I hope I’ve left you to make your own decisions on that.