Monthly Archives: February 2013

StatSoft February Webinar: Reduce Research from Weeks to Hours: Text Mining

Box-Cox Transformation

This example uses the example data set Aircraft.sta, which is distributed with STATISTICA. From the File menu, select Open Examples. Double-click the Datasets folder, and locate and open Aircraft.sta.

The goal of this example is to perform a Box-Cox transformation of one dependent variable, VISC, and save the transformed values back to the original data set.

1.) With the data set open, add a new variable. This will be where STATISTICA writes the transformed valued for use in analyses. For this example, add the variable after VISC, and name it Visc BoxCox. There are several ways to add a new variable.

a) Data – Variables – Add

Box-Cox Transformation

b) Or from the Vars menu, select Add

Box-Cox Transformation

c) Or double-click in the gray space around the data in a spreadsheet.
Box-Cox Transformation

2. The spreadsheet should now look like this:

Box-Cox Transformation
3. From the Data menu, select Box-Cox Transformation.

Box-Cox Transformation
4. In the Box-Cox Transformation dialog, click the Variables button, and the select the variable you want to transform. For this example choose VISC. Click OK in the variable selection dialog.

Box-Cox Transformation

5. The values used for computations can be adjusted if desired. But for this example, we will leave them as is.

6. Click OK in the Box-Cox Transformation dialog to display the Box-Cox Results dialog.

Box-Cox Transformation

7. There are several options to display results such as plots and summaries of the estimated values. But, to place the transformed values back to the original data set, click the Write back to input spreadsheet button.

Box-Cox Transformation

8. The Assign statistics to variables for saving in input data dialog is displayed. Here, we will specify to which variable to write the transformed values back to.

9. Under Statistics, choose the variable that was transformed, VISC. Under Variables, choose the new variable location for the transformed data, Visc BoxCox. Then click Assign.
Box-Cox Transformation

10.  The assignment will be displayed in the bottom box.

Box-Cox Transformation

11. Click OK. You may see a warning message about an analysis being in progress. That is fine, just click OK. The original spreadsheet will be updated to include the transformed values in the Visc BoxCox variable.

Box-Cox Transformation

How to Remove Menu Commands Using the Classic Menus Interface

STATISTICA offers the flexibility of fully customizable user interfaces. Adjusting the standard user interface to better suit your specific needs is a quick and easy task with STATISTICA. This article will benefit those who use the classic menus (hierarchical style) interface.

One way of customizing the user interface is to delete menu commands. Perhaps there are commands on a menu that you do not use or are not available (appear dimmed) based on the STATISTICA package you have installed, so you may want to remove these commands to make the user interface more appropriate to your needs. You can also remove entire menus. This example describes removing menus and/or commands from the classic menus.

First, from the Tools menu, select Customize.

The Customize dialog will be displayed.

 

 

 

 

 

 

 

 

 

 
Select the Menu tab. The Show menus for option is a drop-down list where you can select different document types in STATISTICA (Spreadsheet, Graph, Report, Workbook, etc.). Each document type has its own associated menus. For example, when a spreadsheet is active, the Data menu is available, but it is not displayed when other document types are active.

Now, with the Customize dialog displayed, and Spreadsheet selected in the Show menus for list, right-click on the menu or menu command you want to delete [all menu commands are available (not dimmed) when the Customize dialog is displayed, which enables you to delete them]. A shortcut menu will be displayed.


 

 

 

 

 

 

 

 

 

 

 

 

 

Click Delete to remove that menu or command. In the image above, we are removing Automated Neural Networks from the Statistics menu.

However, this only removes this menu/command from the Statistics menu associated with general spreadsheets.  If you select a different document type, such as a Graph, then the Statistics menu will still have that command listed. As previously mentioned, each document type has its own set of menus, so you need to repeat these instructions several times with each different type of document active.


 

 

 

 

 

 

 

 

 

 

 


 

 

 

 

 

 

 

 

 

 

 

 

If you want to add deleted commands back to a menu, select the document type on the Menu tab and click the Reset button to reset the currently selected menu and undo any changes you have made by returning the menu to its default composition. You can also click the Reset All button to reset all of the menus to their default compositions.

How to Specify Properties for Point Markers in STATISTICA Graphs

Controls for modifying point markers for various plots are located in the General dialog, accessible by right-clicking on a graph point and selecting General Plot Options from the shortcut menu.

You can also access these controls in the Plot: General options pane of the Graph Options dialog, accessible by double-clicking on a point marker or by selecting Graph Options from the Format menu.

Click the Markers button to display the Marker Properties dialog, where you can change the marker size, color, and/or pattern. You can also specify formatted text as point markers.

The size of point markers (and fonts) can also be increased or decreased using the Increase Font or Decrease Font toolbar buttons.

 

Support for F4. To simplify the process of editing graph display features (e.g., font color, point markers, area patterns), STATISTICA provides support for the F4 key on your keyboard. This means that you can repeat the last command you performed in the graph. For example, if you have just changed the title font to Arial, 12pt, Italic and you want to make the same change to the axis titles, simply highlight the axis title you want to update, and press F4. Note that the F4 buffer is graph specific. If you switch to a different graph and press F4, you will repeat the last action performed on that graph.

Fusion-io Unleashes CPUs, Reduces I/O Bottleneck to Accelerate StatSoft Data Analysis Software

Fusion-io (NYSE: FIO), developer of a next-generation shared data decentralization platform, today announced that StatSoft, provider of a comprehensive array of data analysis, data management, data visualization and data mining technologies, recently tested Fusion’s ioMemory platform with its flagship STATISTICA software, comparing Fusion ioDrives to disk-based storage components.

The StatSoft white paper concluded that Fusion ioDrives significantly increased the I/O performance in the STATISTICA suite of analytics software products and solutions, greatly increasing CPU utilization and efficiency. With ioMemory, StatSoft achieved 300 and 500 percent data performance and latency reduction improvements when compared to legacy disk-based storage. With the increased I/O performance enabled by the Fusion ioDrives, CPU utilization increased to 90 percent in tests of large data sets, versus the 32 percent CPU utilization observed with the disk-based technology.

“Our global client base shares a common need for the fastest possible data access in order to perform analyses that drive business-critical decisions,” said George Butler, Vice President, Platform Development, StatSoft. “STATISTICA is already among the fastest data analysis software tools on the market. With the Fusion-io memory platform, STATISTICA customers can analyze information from even the largest data sets more quickly than ever before through a solution that greatly improves the efficiency of their current infrastructure.”

STATISTICA is widely used as an integral component of corporate computer infrastructures to boost productivity and the bottom line, to increase safety, reduce industrial pollution and develop environmental solutions. The STATISTICA product line and its scalable, fully web-enabled distributed processing systems are utilized in more than 60 countries in numerous languages. STATISTICA’s 1 million end users include leading global corporations in verticals such as manufacturing, power generation, semiconductors, pharmaceutical, chemical, petrochemical, food processing, automotive, heavy equipment, insurance, telecom, R&D and more.

“Few enterprises today have the luxury of wasting capital to build and run an enormous, inefficient storage system,” said Neil Carson, Fusion-io Chief Technology Officer. “Seeking an intelligently simple solution, StatSoft’s forward-thinking engineers have found a way to deliver data to their customers even faster by boosting the efficiency of their current infrastructures. The StatSoft white paper clearly demonstrates the potential for Fusion ioMemory to unlock hidden value by reducing the I/O bottleneck to put idle CPUs to work.”

STATISTICA is optimized for processing large amounts of data, so quick access to stored data is essential. Whether processing a large STATISTICA Spreadsheet in read-only mode for analysis, or creating temporary objects during data management operations, storage performance and latency directly affects application performance.

StatSoft tested the ioDrives in two categories: Extensive Temp directory access and analyzing large spreadsheets. In real-world scenarios accessing Temp directories, Fusion-io enabled performance improvements three times greater than traditional disks. In analyzing large spreadsheets, Fusion’s technology produced five times the performance of disk-based storage. To review the StatSoft white paper, “STATISTICA Performance with Fusion-io ioDrive,” go to

http://www.statsoft.com/Portals/0/Support/Download/STATISTICA_Fusion_ioDrive_WhitePaper.pdf

To learn more about Fusion-io, go to www.fusionio.com. Follow Fusion-io on Twitter at www.twitter.com/fusionio or www.twitter.com/fusionioUK and on Facebook at www.facebook.com/fusionio.

About Fusion-io

Fusion-io has pioneered a next generation storage memory platform for shared data decentralization that significantly improves the processing capabilities within a datacenter by relocating process-critical, or “active,” data from centralized storage to the server where it is being processed, a methodology referred to as data decentralization. Fusion’s integrated hardware and software solutions leverage non-volatile memory to significantly increase datacenter efficiency and offer enterprise grade performance, reliability, availability and manageability. Fusion’s data decentralization platform can transform legacy architectures into next generation datacenters and allows enterprises to consolidate or significantly reduce complex and expensive high performance storage, high performance networking and memory-rich servers. Fusion’s platform enables enterprises to increase the utilization, performance, and efficiency of their datacenter resources and extract greater value from their information assets.

Forward-looking Statements

Certain statements in this release may constitute “forward-looking statements” within the meaning of Section 21E of the Securities Exchange Act of 1934 and Section 27A of the Securities Act of 1933, including, but are not limited to, statements concerning the performance improvements experienced and test results reported by StatSoft and the effect of these on its operations and business and that of its STATISTICA suite customers.  These statements are based on current expectations and assumptions regarding future events and business performance and involve certain risks and uncertainties that could cause actual results to differ materially from those contained, anticipated, or implied in any forward-looking statement, including, but not limited to, the risk that StatSoft may not realize the advantages it expects from deploying our technology and that other users of STATISTICA with our technology may not experience the performance advantages reported by StatSoft in its testing, and such other risks set forth in the registration statements and reports that Fusion-io files with the U.S. Securities and Exchange Commission, which are available on the Investor Relations section of our website at www.fusionio.com. You should not rely upon forward-looking statements as predictions of future events.  Although we believe that the expectations reflected in the forward-looking statements are reasonable, we cannot guarantee that the future results, levels of activity, performance or events and circumstances reflected in the forward-looking statements will be achieved or will occur. Fusion-io undertakes no obligation to update publicly any forward-looking statement for any reason after the date of this press release.

Pepsi Saves Time using STATISTICA Enterprise for Quality Monitoring and Control

Pepsi Americas

General Bottlers, part of Pepsi Americas, has implemented STATISTICA Enterprise to automate the process of monitoring the beverage manufacturing process.

This has saved time and costs while improving the quality of the process with the use of modern analytic tools.

Read more about how Pepsi Saves Time using STATISTICA Enterprise for Quality Monitoring and Control

Learning Statistics

not random numbers statistics textbook

I was invited to start writing a personal blog about my experiences with project management and STATISTICA. I have been an employee since 2005 and have worked on very diverse project types.

Some projects were two weeks. Some projects lasted almost 2 years. They were for StatSoft customers and for StatSoft internal use.

In 2009, I worked on a project to migrate the StatSoft Electronic Statistics Textbook (EST) from HTML pages into a Content Management System (CMS). In many ways this was a “StatSoft internal” project. We wanted an easier method to update pages and track changes.

The project itself was easy to plan and easy to execute. We had about 200 pages to move from HTML into a CMS.

The most interesting point about this project was the actual book.

This online statistics textbook has been available since at least 1997. I found a copy of a few pages on the WayBack Machine. These 1997 pages supported the AOL browser and Netscape 2.0 browser. This was the year that Internet Explorer 4.0 was released, and Windows 95 was the common operating system.

EST visitors now use Internet Explorer 6, 7, 8, or 9, FireFox, Chrome, Opera, and Safari. And they visit us from their phones; Android, Blackberry, iPhone, Nokia, etc. They visit us from other devices: iTouch, iPad, Nook, Playstation 3. (I don’t know why, but I am amused by the Playstation 3 visitors. Maybe these are college students?)

During this project, I discovered that StatSoft management had offered this textbook online as a public service. This book was developed from Dr. Paul Lewicki and Dr. Thomas Hill’s teaching experiences at The University of Tulsa. Both are part of the senior management at StatSoft.

Understanding statistics are important to all of us. Statistics surrounds our daily lives. It impacts:

  • the medication we take
  • the education system our children attend
  • the food we eat
  • the cars we drive
  • the air we breath

I was fascinated and stunned by how many people use the EST.

There are hundreds of thousands of pages that reference the EST. It is used by many universities to teach statistics.

Wikipedia editors see the EST as an expert source of materials. They use different pages as references.

Google Books can find 390 printed books that use the EST as a reference.

The EST is used as references in patents.

Thousands of scholarly papers reference the EST.

When I search for EST hyperlinks in Facebook and Twitter, there are always people talking about the book.

And the EST wins different awards and recommendations. It was recommended by  Encyclopedia Britannica for learning about statistics. Recently, we won a Best of Web award. EST got the highest rating (Cons = None).

What’s the Big Deal about “Big Data”

Written by: Danny Stout

Now that I’m officially 39 years old…again…I’ve been around long enough to hear more than a few phrases become popular and then disappear after a few months or years.  Most of them have been used in popular culture, but there have also been terms used professionally as well that are no longer groovy.  One such popular term that I’ve been hearing a lot lately is Big Data.  It seems that the term Big Data may mean different things to different people.  What exactly is Big Data?  What is Big Data doing for us?  Finally, is Big Data here to stay, or will we be talking about a new popular term in a few more years?

Wikipedia tells us that Big Data is a “collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications”.  I remember when a gigabyte seemed intimidating.  Now my hard drive holds more gigabytes than Carter has liver pills.  So if our hard drives hold gigabytes and even terabytes of data, what exactly defines Big Data?  While gigabytes and terabytes can get large, they can still be managed on traditional hard drives and network servers.  So when you get to Big Data, or in excess of hundreds or even thousands of terabytes of data, you really start seeing some Big Data.  And when you get into data this large, you can no longer rely on traditional storage for your data.  More than likely you’ll have to rely on distributed file systems or other storage technologies.  When you get into data of this size, stored in this manner, you are dealing with Big Data.
What can Big Data do for you?  For a recent example look at the victory of President Obama in the recent U.S. election.  The Wall Street Journal has an excellent article written about how Big Data, and the analysis of that data, contributed to his reelection.  His campaign merged databases from many sources including those from pollsters, fundraisers, fieldworkers, consumer databases, social media, mobile contacts and voter files, and used that data to improve their campaign efforts.  They used algorithms to score this data, deriving persuadability scores for potential undecided voters.  “The persuasion scores allowed the campaign to focus its outreach efforts—and their volunteer calls—on voters who might actually change their minds as the result. It also guided them in what policy messages individual voters should hear.”  That’s pretty powerful stuff.  If you’ve shopped at Walmart, you’ve been on the receiving end of Big Data analysis.  People who shop at Walmart generate more than 1 million customer transactions every hour.  The data from these transactions are imported into databases estimated to contain more than 2.5 petabytes.  That information can be used to determine where to place products or what items to put next to one another so that you will be more likely to buy both products.  If you have access to Big Data, you can definitely make it work for you and hit any number of targets.
Now, is Big Data here to stay?  The focus is no longer on hypothesis testing in predictive analytics, but on looking to the data for revealing patterns.  The data is the model.  We are in what I like to call the Angler’s era, but instead of talking about how big someone’s fish is, everyone is talking about the size of their Big Data.  Bigger is not necessarily better in Big Data.  With sophisticated sampling methodology, you don’t need to analyze 2.5 petabytes of data just because you have access to the database.  It’s a waste of time and resources.  See Dr. Thomas Hill’s white paper on Big Data for an excellent discussion of this topic.  I’m not sure if I’ll always capitalize the term, and I probably shouldn’t be doing it now.  But I do think that big data, or the impact of big data, will be with us for the foreseeable future.   Last year, the Big Data Research and Development Initiative was introduced.  The Obama Administration announced $200 million dollars in new research and development investments to handle the rapidly growing volume of data.  They hope to use big data to solve some of the Nation’s most pressing challenges.  While we will hopefully move out of the Angler’s era in regards to big data, I believe big data, and the impact of big data, will definitely be felt for a long time to come.

In the future, Mr. Spock will use STATISTICA

Hello, I am Paul Hiller, marketing specialist at StatSoft HQ. I am a statistician neither by training nor osmosis, so I guess that makes me one of the outliers around here. Despite this, I did enjoy (and pass) a college-level Statistics 101 course many years ago. My instructor’s name was Jean Spath. I remember her name because she announced confidently on Day One, “My name is Jean SPATH, rhymes with MATH, so you’ll never forget it.” And she was right!

In case you are wondering, this is not a photo of me.

Blogwise, I have been invited to contribute a layman’s perspective, and so today I have decided to jot down some random thoughts about Star Trek, statistical glossary terms, and other musings that have formed during my time with the StatSoft team…

  • In the future, Mr. Spock will use STATISTICA. And why not? BIG data will have given way to GALACTIC data by then, and it is only logical he will want the best tool for real-time scoring when going where no man has gone before.
  • Analytics–Wasn’t that the name they gave to a U.S.S. Enterprise systems check in Star Trek ? No, wait–that was diagnostics, as in, “Run a Level 3 Diagnostic, Mr. LaForge.”
  • Captain Kirk always asked Mr. Spock for “analysis”…before proceeding to defy the odds anyway.
  • The U.S.S. STATISTICA was not the name of a federation science vessel. But it should have been!
  • It just occurred to me that U.S.S. STATISTICA sounds so natural because of that other classic TV show, Battlestar Galactica. Statistica. Galactica. Statistica. Galactica.
  • Glad to see our Facebook galactics–um, analytics have shown increased engagement lately. We even bumped up on Klout this week! (Okay, this one isn’t such a random thought, because obviously I had to pause and look up the Klout score.)
  • On the whole, predictive analytics certainly seems a smarter way to go than using a crystal ball or flipping a coin.
  • Big Data — Is that the next straw man industry to be subjected to government harassment when political expediency demands it? Like Big Oil, Big Pharma, and Big Auto?
  • On the other hand, “The Box and Whisker” would be a good name for an Irish pub.
  • RE: “2D Matrix Plots,” I enjoyed all three of them. Did the Wachowski Brothers ever develop a 3D Matrix Plot for #4? Early rumors were confirmed false.
  • Come to think of it, recalling the 0s and 1s raining down the big screen in “The Matrix”–now that was some big data!
  • When it comes to heuristics, is the statement that a consumer product “may help reduce the risk of [something]” as useless as it sounds?
  • I suspect a “Recursive Filter” is something I should avoid when brewing coffee.
  • Sun Ray Plots can be so dramatic! Check out this grouping here in St. Johns, Woking, England.

StatSoft Leaders Recognized Again with Latest Text Mining Book

The Association of American Publishers has recognized two StatSoft leaders with a second major national award for co-authorship of a new book on text/data mining.

Dr. Gary Miner, StatSoft’s Senior Statistical & Predictive Analytics Consultant, and Dr. Thomas Hill, StatSoft’s VP of Analytic Solutions, are both co-authors of Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications, which received the 2012 American Publishers Award for Professional and Scholarly Excellence (PROSE) this month. Other authors of the book, which was published by Elsevier/Academic Press, included Dursun Delen, John Elder, Andrew Fast, and Robert A. Nisbet.

“We are honored to receive this important award,” notes Dr. Hill, “which I believe validates the attention that all authors devoted to create an accessible resource for practitioners who want to leverage text mining technologies for improving day-to-day business processes.”

“The PROSE Award is like the Oscar® for writing and publishing,” writes an excited Dr. Miner, “and we have won it two books in a row!”

Previously, Dr. Miner, Nisbet, and Elder had been awarded a 2009 PROSE for the Handbook of Statistical Analysis and Data Mining Applications (also published by Elsevier/Academic Press).

The PROSE Awards annually recognize the very best in professional and scholarly publishing by bringing attention to distinguished books, journals, and electronic content in over 40 categories. Judged by peer publishers, librarians, and medical professionals since 1976, the PROSE Awards are extraordinary for their breadth and depth.

The association’s Professional and Scholarly Publishing (PSP) Division presented this year’s awards at a special luncheon ceremony during the PSP Annual Conference in Washington, D.C., on Feb. 7, 2013. Miner & Hill’s text mining book won its PROSE in the Computing & Information Sciences category.