STATISTICA HP – Big Data Performance using Massively Parallel and In-Memory Processing

CPU usage showing all CPUs being utilized by STATISTICA HP

STATISTICA HP – Big Data Performance using Massively Parallel and In-Memory Processing

StatSoft announces STATISTICA HP, the latest release of Version 12 of the STATISTICA analytics platform, designed to leverage information contained in extremely large data sets using massively parallel and in-memory processing.

This technology allows StatSoft customers to bring supercomputing performance to their big data, leveraging the power of multiprocessor servers that are rapidly becoming more affordable and widely available as part of existing computer infrastructures of not only large but also midsize and even some small companies. For example, the familiar Microsoft Windows® server-based environment is now available with up to 640 logical processors (with Windows Server 2012, up to logical 256 processors with Windows Server 2008 R2).

“StatSoft has the distinction of being the only analytics and predictive modeling platform specifically optimized for Windows computing platforms,” says Dr. Thomas Hill, StatSoft’s VP for Analytic Solutions. “With the latest release of STATISTICA HP, we have achieved remarkable performance on practically all computational tasks, in particular for in-memory data processing on high-performance servers.”

To illustrate the remarkable performance of the STATISTICA analytic system, StatSoft has conducted performance tests on a midrange, 64-core server machine with 256GB of RAM.

Statistical Computations and Summaries

As discussed in detail in the StatSoft White Paper (The Big Data Revolution And How to Extract Value from Big Data), many of the use cases around big and high-velocity data involve data summarization, aggregation, and the identification of basic relationships.

Shown below is a screenshot of STATISTICA running against a data set with 1 million records and 1,000 fields, computing 1 million correlations.

The STATISTICA software successfully distributes the required computational load over all of the available CPUs, utilizing 100% of the hardware resources available in this system. Computing 1 million correlations on a data set with 1 million records completes within seconds or less (depending on the clock speed, and memory access architecture of the system).

The Power of Parallel Processing for Predictive Modeling

The architecture of STATISTICA HP provides numerous optimizations that involve massive parallelization, both during the model building process as well as the scoring process.

For example, analytic workspaces such as the one shown below can be run on multicore servers where the competitive evaluation of multiple models is effectively performed in parallel across multiple cores, achieving 100% utilization of the computing resources of the system and yielding remarkable performance.

STATISTICA HP workspace evaluating multiple models across multiple cores
STATISTICA HP workspace evaluating multiple models across multiple cores. (Click to enlarge)

Building an effective tree-based classification model against 1 million records on the 64-core 256GB RAM platform described earlier completes in seconds or less (depending on the clock speed and memory access architecture of the system).

Also, many of StatSoft’s customers are currently using the STATISTICA Enterprise Server™ platform to enable massively parallel model-scoring in virtual on-demand environments, again highlighting the flexibility and utility associated with STATISTICA’s adherence to and compatibility with modern software standards, interfaces, and emerging computing technologies.

In addition, in STATISTICA HP 12, all advanced modeling algorithms–including the most powerful ensemble models such as random tree forests, gradient boosted trees, and others–are implemented to take advantage of large numbers of cores and available RAM for efficient in-memory model building against big data.

Summary

Computing platforms with large numbers of CPUs and cores and capabilities to handle huge data files via in-memory processing are rapidly becoming less expensive and more common not only in science but also in business use. Too often, however, the bottleneck is the (analytic) software which limits the performance that can be achieved with such hardware. According to George Butler, StatSoft’s VP for Platform Development, “StatSoft has accumulated significant expertise over decades on how to optimize the performance of analytic software, and the STATISTICA HP platform will fully take advantage of Microsoft’s newest server platforms supporting hundreds of cores.” He continues, “We are a close Microsoft Partner but also an Intel® Software Premier Elite Partner, and our R&D is constantly looking for new and better ways to leverage existing hardware and operating system resources.”

Advertisements

About statsoftsa

StatSoft, Inc. was founded in 1984 and is now one of the largest global providers of analytic software worldwide. StatSoft is also the largest manufacturer of enterprise-wide quality control and improvement software systems in the world, and the only company capable of supporting its QC products worldwide, with wholly owned subsidiaries in all major markets (StatSoft has 23 full-service offices, on all continents), and its software is available in more than 10 languages.

Posted on May 24, 2013, in Uncategorized. Bookmark the permalink. Leave a comment.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: