Notes2Self.net

Stephen McGibbon's Web Journal

Stephen's Cache of Google's Cache of IBM's OpenXML Document

This is G o o g l e's cache of http://www.ibm.com/developerworks/db2/library/techarticle/dm-0705gruber/ as retrieved on 5 Jun 2007 09:00:43 GMT.
G o o g l e's cache is the snapshot that we took of the page as we crawled the web.
The page may have changed since that time. Click here for the current page without highlighting.
This cached page may reference images which are no longer available. Click here for the cached text only.
To link to or bookmark this page, use the following url: http://www.google.com/search?q=cache:FBZj4heo5xEJ:www.ibm.com/developerworks/db2/library/techarticle/dm-0705gruber/+%22chris+c+gruber%22&hl=en&ct=clnk&cd=1


Google is neither affiliated with the authors of this page nor responsible for its content.
These search terms have been highlighted:  chris  gruber 

IBM®
Skip to main content
    Country/region [select]      Terms of use
 
 
    
     Home      Products      Services & industry solutions      Support & downloads      My IBM     
skip to main content

developerWorks  >  Information Management | XML | Open source  >

Manage Microsoft Office 2007 documents with DB2 9 pureXML

Storing and re-purposing data with PHP's PDO and XQuery using IBM DB2 9

developerWorks
Document options

Discuss

Sample code


Rate this page

Help us improve this content


Level: Intermediate

Chris C. Gruber (mailto:gruber@ca.ibm.com?subject=Manage Microsoft Office 2007 documents with DB2 9 pureXML), IBM Technical Marketing, Developer Initiatives, IBM

24 May 2007

Integrate your Microsoft® Office 2007 documents into your enterprise and Internet applications more easily than ever before with DB2® 9. Review older methods of data interchange with MS Office documents, and learn how MS Office 2007 offers better data interchange. This article discusses interchange with DB2 9 XQuery, Zend Core for IBM®, HP: Hypertext Preprocessor (PHP), and Php Data Objects (PDO) technologies.

Introduction

What's new with desktop documents? A great deal, if you follow the Windows® community. Microsoft has introduced a new XML Paper Specification (XPS) format for MS Office 2007 products that leverages XML internally. You could exchange data between MS Office applications before; but MS Office 2007 and Windows Vista allow enterprises to tap into islands of data that were historically very difficult to consume and re-purpose (that is, to utilize data intended for one purpose for another purpose). For example, consider a company whose marketing department is planning a conference. They ask prospective speakers to submit proposals in a Microsoft Word template document, and the document contents are then used for the following tasks:

  • Reviewing submissions
  • E-mailing the potential speakers of successful submissions
  • Posting related information on a conference Web site
  • Publishing content through a new feed (ATOM/RSS) for the conference or company

Traditionally, marketing staff would cut text from the Word document and paste the information into the various systems, which was time consuming. Retaining the document's original format and extracting relevant information for specific tasks saves substantial time and effort. Using XQuery, you can do this with a surprisingly small amount of code. In this article, we'll briefly review older document data interchange methods for MS Office applications. Then, we'll look at the new format and discuss how this format can be interchanged and re-purposed. We'll use cross platform technologies like Zend Core for IBM, PHP, PDO and XQuery -- all tools you can use with DB2.

This technique can be a valuable part of content management and document management solutions. After reading this article, I hope you will find that consuming and re-purposing MS Office 2007 documents is easy with IBM DB2 pureXML™ features. Not much code is required, so this solution is easy to implement. Later, you may want to index these XPS documents to further take advantage of the pureXML hybrid storage features.

Resources required

While you can download Apache 2.0, PHP Version 5.21, DB2 Express-C, and the DB2 extensions for PHP, downloading and installing Zend Core for IBM is a better option. The installer completely configures your environment for PHP and DB2. The installer should include links for downloading DB2 Express-C 9.1.2. You may need to tweak the configuration to add zip support, which we'll discuss later.

Old document data interchange

Data Application Programming Interfaces (APIs) on the Windows operating system started with Data Access Objects (DAO). They progressed from DAO (not Dead On Arrival) to Remote Data Objects (RDO), and then Open Database Connectivity (ODBC). The transition to ODBC was great and marked milestone in access. It enabled a common means for developers to code to one API regardless of the database. Using ODBC, you can import relational data into MS Office applications through the likes of Mail Merge.

The introduction of the Object Linking and Embedding Data Base (OLE/DB), which enables data interchange regardless of data source, only confused the market. For MS Office productivity tools, this meant that you could embed documents within each other. In fact, you see this today when pasting a spreadsheet into a presentation. However, the OLE/DB is a Windows-only solution. Dissecting documents was not very easy with these technologies, and automating the process was difficult.

An overview of XPS

Windows Vista and MS Office 2007 introduces a new document format. Microsoft touts that this format is accessible and easily shared. MS Office documents have been proprietary for many years, and it is refreshing that the Office team within Microsoft is opening the doors. The XPS format is effectively a ZIP file containing resources and a folder for MS Office documents. The actual text content of something like a Word document is stored as an XML file. In our sample document, it's found in "word/document.xml". The XPS file format allows developers to capture the true data within these documents. See the ZIP file contents below:


Figure 1. Contents of submit.docx within WinZip
WinZip file illustrating that there are XML documents with a Word 2007 document

Setting up the database

Of course, you'll need to create a table. You'll store the XML of the XPS format within the XML column rather than in a Character Large Object (CLOB). To do this, we'll need to create an XML-enabled or UTF-8 database.


Listing 1. Creating an XML-enabled database
                
CREATE DATABASE XPS AUTOMATIC STORAGE YES USING CODESET UTF-8 TERRITORY US
      

Now, you'll want to connect to the database and create a table with the following definition. The code below assumes you have a username of "db2admin" with CREATETAB authorities. If you do not have this user setup, you can adjust the scripts for an appropriate user on your database.


Listing 2. Creating an XML-enabled table to contain XPS
                
CREATE TABLE DB2ADMIN.DOCUMENT (
         ID INT NOT NULL PRIMARY KEY,
         OWNER VARCHAR(128),
         DOC XML
       )

Setting up PHP and Zend Core for IBM

The code will use PHP's ZIP utilities and PDO. PDO is already set up for you with Zend Core for IBM, but you need to add "zip" to the configuration:

  1. Open the Zend Core Administration Console.
  2. Switch to the Configuration tab and Extensions sub-tab.
  3. Scroll down to the Zip extension.
  4. Click on the switch icon to turn it on.
  5. Click on Save Settings.
  6. Restart Apache2.
    1. If it's not already running as a red feather in the system tray, start C:\Program Files\Zend\Apache2\bin\ApacheMonitor.exe.
    2. Click on it, and restart the Apache2 HTTP Server from the menu.

If this does not work for you, try editing the file and make sure the following lines are found in php.ini within the "C:\Program Files\Zend\Core for IBM\etc" folder. Add the line in bold below to your php.ini:


Listing 3. Modifications for php.ini
                
extension=php_zip.dll
extension=php_pdo.dll
extension=php_pdo_ibm.dll
extension_dir="c:\program files\Zend\Core for IBM\lib\phpext"
            

Consuming XPS

Now, you're ready to consume the XPS file. Included in the download, I have a submit.docx file which uses MS Office 2007 XPS format (that we can agree upon for download purposes). Listing 5 represents what this document looks like when viewed in Word. The document utilizes Word styles like "Heading 1" and "Heading 2" for formatting. Later, we'll query against the Heading 2 style.


Figure 2. XPS Word document shown in Word 2007
Word 2007 document as displayed in Word 2007

Let's consume this document with the following PHP code, which effectively loops through the entries in the manifest of the zip and pulls the XML document with relevant data. Listing 5 provides the code. Save this code under the filename "xpsconsume.php" for reference purposes.


Listing 5. PHP code to consume XPS format (xpsconsume.php)
                
<?php
  $user = "db2admin";
  $password = "secret";	
  $zip = zip_open("c:\submit.docx");
  $db = new PDO("ibm:DRIVER={IBM DB2 ODBC DRIVER};DATABASE=XPS;" .
    "HOSTNAME=localhost;PORT=50000;PROTOCOL=TCPIP;", $user, $password);
  echo $user." Connected\n";


  if ($zip) {

     while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry, "r") & 
                 zip_entry_name($zip_entry) == 'word/document.xml' ) {

            $buf = zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));
	    $insstr ="INSERT INTO DB2ADMIN.DOCUMENT VALUES (1,'BILLY ONAIRE',:buf)";
	    $stmt = $db->prepare( $insstr );
	    $stmt->bindParam( ':buf' , $buf , PDO::PARAM_LOB , strlen($buf) );
  	    $stmt->execute();
	    echo "Result: ".$db->errorCode();
            zip_entry_close($zip_entry);
        }
        echo "\n";
    }
    zip_close($zip);

  }

?>
      

Run this code with the following command line:

 php xpsconsume.php

This inserts an XML document in the database and prints out an error code. Hopefully, the code will be zero.

Re-purposing XPS

Now that you have the XML portion of the XPS document in the database, you need to re-purpose this document for your Web site. This is what the Marketing department would have to do in our example. The code is in Listing 7, which you can save as "xpsrepurpose.php" for references purposes. From the output, you can easily see how to reformat this into a news feed or incorporated into an Ajax application!


Listing 6. Re-purposing the submit.docx into a simple HTML paragraph
                
<?php
  $user = "db2admin";
  $password = "secret";	

  $db = new PDO("ibm:DRIVER={IBM DB2 ODBC DRIVER};DATABASE=XPS;" .
    "HOSTNAME=localhost;PORT=50000;PROTOCOL=TCPIP;", $user, $password);
  echo $user." Connected\n";

  $xqry = 
<<<TXT
values( 
  XMLSERIALIZE( XMLQUERY(' 
    declare namespace w0="http://schemas.openxmlformats.org/wordprocessingml/2006/main";	
    for \$t0 in db2-fn:xmlcolumn("DB2ADMIN.DOCUMENT.DOC")/w0:document/w0:body/w0:p
       let \$style := \$t0/w0:pPr/w0:pStyle
       let \$txt := \$t0/w0:r/w0:t/text() 
       where fn:exists(\$txt) or \$style/@w0:val =''Heading2''
       return  
         if ( fn:exists( \$style ) ) then 
            if (\$style/@w0:val =''Heading2'' )  then <h1>{\$txt}</h1>
            else  () 
         else <p>{\$txt}</p>') 
  as VARCHAR(2000)))
TXT;

  $result=$db->query( $xqry );
  $arr = $result->fetch();
  echo $arr[1];

?>
      

In order to run the code, use:

php.exe xpsrepurpose.php 

Now you may ask, why wouldn't you just save this as HTML? The XML generation could jeopardize the integrity of the original document and lose MS Office features. With that said, the document does get much smaller and you can still re-purpose it with XQuery, as with the XPS file format. As your documents get bigger and there are more of them (such as in the case of big conferences), you'll want to apply this logic for each section of each document in your repertoire; then the power of XQuery really shines through! Like SQL, XQuery lets you handle sets of documents with one query. In fact, you can bring the results into SQL queries again if you so wish because DB2 can interchange the languages.

Tip: I was assisted in building this XQuery statement using DB2 Developer Workbench XQuery support. I managed to build the query through point and click. This saved a fair bit of time.

The following listing illustrates the result of the xpsrepurpose.php:


Listing 7: The output
                

db2admin Connected
<h1>Introduction to Web 2.0</h1><p>This is a document
 that will impress upon ...</p><h1>Abstract</h1><p>The nature of the industry is a
gain turning to the browser ...</p>

Share this...

digg Digg this story
del.icio.us Post to del.icio.us
Slashdot Slashdot it!

Conclusion

The techniques discussed in this article can be useful in content management and document management solutions. Hopefully, you've found that consuming and re-purposing MS Office 2007 documents is easy with IBM DB2 pureXML features. There really is not much code involved, which is conducive to great performance. Consider indexing these XPS documents later to further take advantage of the pureXML hybrid storage features.

To learn more on XQuery, work with the XQuery editor in DB2 Developer Workbench or look at DB2 9: pureXML Overview and Fast Start (IBM Redbooks). See the IBM_PDO site on pecl.php.net for more information on PDO.




Back to top


Download

Description Name Size Download method
Sample PHP scripts and submit.docx file mngingoffice2007.zip 11KB HTTP
Information about download methods


Resources

Learn

Get products and technologies
  • DB2 Express-C can be freely distributed with your applications.

  • Zend Core for IBM bundles Apache, PHP, and all the necessary modules for DB2 with a convenient administration console.

  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss


About the author

Chris' photo

With more than thirteen years of industry experience, Chris Gruber, Technical Manager for Developer Initiatives, works with IBM Information Management on industry-leading Data Server products. Previously, Chris has worked as a Senior Product Manager for Sybase iAnywhere Solutions. He has worked very closely with engineering and development partners accelerating partners time to the market.




Rate this page


Please take a moment to complete this form to help us better serve you.



Yes No Don't know
 


 


1 2 3 4 5
Not
useful
Extremely
useful
 


Back to top



    About IBM Privacy Contact

Comments

Notes2Self.net said:

A couple of weeks ago there was an interesting post on IBM's developerWorks site entitled "Manage Microsoft

# June 13, 2007 7:43 PM

Brian Jones: Open XML Formats said:

Here are a few interesting links I came across this week: Open XML in Science and Nature - Murray Sargent

# June 13, 2007 11:05 PM

Doug Mahugh said:

Professor Flavio Soares da Silva from the University of Sao Paulo has an interesting article on "Arguments

# June 14, 2007 3:40 AM