Importing PubMed XML into STATISTICA

statsoft_south_africa

Importing PubMed XML into STATISTICA

Original Source – Written by

27th of May, 2013

Here is a shout out to all of our veterans for this great country of the United States of America.  I appreciate your sacrifice in my behalf to protect my freedom.  I cannot express in words my gratitude for your service.  I hope we all will take a few moments to reflect on the true meaning of Memorial Day.

In my last blog I mentioned I would talk about importing XML into STATISTICA.  In my experience it pays to follow a methodology when approaching a data mining project.  The methodology I follow is CRISP-DM (1). Today I would like to address the first two steps:  Business Understanding and Data Understanding.  For my case study I will consider a recommendation service for PubMed journal articles.  Back in February I did a webinar for Stat Soft which presented a business case for importing XML data from PubMed (2).  I do not want to spend a lot of time today explaining the business case for a PubMed recommendation service.  If you are interested in learning more about the business case, you can view the webinar here:

http://www.youtube.com/watch?v=xUz9ALT5Fx4

Let’s cut to the chase now.  The first step to understand the data is to be able to access it.  STATISTICA does not have a standard method to import XML which makes this first step a little more difficult.  If you try to open an XML file directly, it will come up as free form text in a text editor window.  You must resort to creating a STATISTICA Visual Basic (SVB) macro to import the XML data.  Before you start coding the SVB, you need to prepare STATISTICA to import XML.  This is done by creating a reference to Microsoft XML Services.  I’ll show this step in the following video, along with the execution of the starter code which is more extensive than what was shared in the webinar.

Starter code:

Sub Main
Dim xmlDoc As DOMDocument
Set xmlDoc = New DOMDocument
xmlDoc.async = False
xmlDoc.validateOnParse = False
xmlDoc.resolveExternals = False
‘Note you will need to update the file location in the following Load statement to match where the PubMed file was downloaded to your computer
xmlDoc.Load(“C:\Users\Toby\Downloads\pubmed_result.xml”)
Set oRoot =xmlDoc.documentElement
Set oItemNodes = oRoot.selectNodes(“//PubmedArticle”)
iLen = oItemNodes.length
Dim s As Spreadsheet
Set s = Spreadsheets.New(“PubMed”)
s.SetSize(iLen, 4)
s.VariableName(1) = “PMID”
s.VariableName(2) = “Title”
s.VariableSetTextType 2, 500
Dim oNode As IXMLDOMNode
For Each oNode In oItemNodes
Set sPMIDNode = oNode.selectSingleNode(“./MedlineCitation/PMID”)
If Not sPMIDNode Is Nothing Then sPMID = sPMIDNode.Text Else sPMID = “”
Set sTitleNode = oNode.selectSingleNode(“./MedlineCitation/Article/ArticleTitle”)
If Not STitleNode Is Nothing Then sTitle = STitleNode.Text Else sTitle = ”
cell = cell + 1
s.SetData(cell, 1, sPMID)
s.SetData(cell, 2, sTitle)
Set sPMID = Nothing
Set sTitle = Nothing
Next
RouteOutput(s).Visible = True
Set sPMIDnode = Nothing
Set sTITLEnode = Nothing
Set oNode = Nothing
Set oItemNodes = Nothing
Set xmlDoc = Nothing
End Sub

How to video:

You need to keep in mind that DOM documents are read into memory which can be a limitation on the file size you can work with on your computer.  I will include some specs for my computer and some benchmarks for different XML file sizes.

Computer Specs:

1.  Windows 8, 64 bit

2.  Intel i7-3770 @ 3.4 GHz

3.  12 GB RAM

Benchmarks:

 

XPath

XPath is used to select the specific nodes you want to gather information from.

XPath examples from the starter code:

XPath to get the PubMed ID’s – “./MedlineCitation/PMID”

XPath to get Document Title – “./MedlineCitation/Article/ArticleTitle”

Here is a good XPath reference:

http://msdn.microsoft.com/en-us/library/ms256086.aspx

Now that you have the starter code and some basic understanding of XPath, I would challenge you to do the following:

1.  Perform a search on PubMed

2.  Download the results in XML format

3.  Import the XML into STATISTICA using the provided starter code

4.  Determine some other item within in the XML that is of interest to you

5.  Modify the XPath in the starter code to import the item of interest

Feel free to post your comments or questions.  I plan on posting again in two weeks where I will demonstrate how to use STATISTICA to convert PubMed Abstract text into a usable format for creating a PubMed recommender service.  Good luck in the coming weeks as you import some XML into STATISTICA!

References:

(1) http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining

(2) http://www.ncbi.nlm.nih.gov/pubmed/

About statsoftsa

StatSoft, Inc. was founded in 1984 and is now one of the largest global providers of analytic software worldwide. StatSoft is also the largest manufacturer of enterprise-wide quality control and improvement software systems in the world, and the only company capable of supporting its QC products worldwide, with wholly owned subsidiaries in all major markets (StatSoft has 23 full-service offices, on all continents), and its software is available in more than 10 languages.

Posted on June 10, 2013, in Uncategorized. Bookmark the permalink. Leave a comment.

Leave a comment