Monday, January 22, 2007

OpenDocument as the perfect Microsoft Office file format

How to:

Add native file support for OpenDocument to Microsoft Office

What is the da Vinci plugin for Microsoft Word? Where did it come from? How does it work? And can it really convert files generated by Microsoft Word versions 97-2007 into fully compliant, OpenDocument format-ready applications? Can it do so so without disruption to bound business processes, dependent line of business applications, and assistive technology add-ons?

Let's begin from a simple truth, discussed in more detail below. The OpenDocument format ("ODF") is able to handle anything Microsoft Office can throw at it, and handle it at least as well as Microsoft's new EOOXML file formats. That includes the bound business processes, assistive technology add-ons, line of business dependencies and advanced feature sets unique to various versions of Microsoft Office. The converse however can not be said of Ecma Office Open XML -- the new Microsoft EOOXML file formats -- as evidenced by Microsoft's own ODF Translator Plug-in Project difficulties.

In April of 2006, the Commonwealth of Massachusetts Information Technology Division broke ranks with IT tradition and put out an open Request for Information concerning the feasibility of an ODF Plugin for Microsoft Office. It is not that Massachusetts ITD had rejected the many current software offerings of ODF vendors and open source community efforts. But they lacked a critical tool to perfect a migration to ODF -- from years of Microsoft Office business processes and binary document dependencies to ODF-ready solutions -- without impossibly costly disruptions to their day-to-day business activities.

Massachusetts ITD correctly reasoned that ODF plugins for the Microsoft Office major apps would solve this problem, enabling a mixed environment of interoperable desktop solutions, giving their business processes much needed time to be migrated for the right reasons at a manageable pace. The right reasons are productivity gains and efficiency enhancements, as opposed to merely reactive efforts to squelch the costly insanity of file format madness. The manageable pace is variable, depending on human and other resource availability, rather than on a disruptive rip-and-replace of most software across entire networks to achieve interoperability.

Imagine what your own office would be like if you came to work one day and discovered that all of the software staff normally used had been replaced on every machine by different software. Could your office afford to shut down long enough for everyone to learn all of the new software and procedures? Is that kind of trauma something you would like to go through? People need to be able to exchange documents. That means software interoperability is key, and it simply is not workable for people to exchange files in formats that are incompatible with software used by the recipients.

Now imagine how much worse the problem would be if your office is dependent on fully-automated business processes where processing is entirely dependent and bound to specific applications, and those applications spanning the process pipeline have a secret language only they can speak. Additional workstations must have the right software if they are to participate in the process. Business process change becomes an issue for the application vendors since they alone can tweak the language specific applications. That's expensive. So much so that many a business lets much-needed process changes lapse until they simply are no longer competitive or profitable, and the cost of rewriting the applications to re engineer the business process must be paid; at any cost. Still, a driving and determining factor is that any new adjustments must be compatible with the information that drove prior iterations of that business process. Anything less than full fidelity in converting legacy documents to new formats greatly limits the changes and productivity enhancements that can be gained as well as ensuring that your applications and service continue to come from the same vendors -- for years and years to come. -- No matter what the cost.

Full fidelity is a term used by the file conversion congnoscenti. It refers to the quality of conversions from one file format to another. Full fidelity means the conversion from one format to another cannot result in data loss. If data loss is possible, for example depending on what software features are invoked in a given document, then we refer to the conversion as lossy.

For example, there is the always problematically lossy and always one way-only conversion of a Microsoft Word 97 binary file format (.doc) to a Microsoft Word 2003 binary file format (.doc). There is also the common file format conversion of a Microsoft Word binary .doc to Rich Text Format ("RTF") and back. For our purposes, we are concerned with the fidelity of file conversions between OpenDocument and binary files bound to Microsoft Word versions from Word 97 to Word 2007. Microsoft is able to perfect a full fidelity conversion of those same binary documents to and from EOOXML, using the Microsoft XML compatibility pack plugin. We argue that the same full fidelity of conversion is possible between Microsoft Word and OpenDocument, using the daVinci plugin for Microsoft Word. In fact, the process of how this is done is very similar to that used by the MS XML compatibility pack.

Automated business processes are key to increased productivity and profit, but they are still relatively few and far between, a situation not expected to last long. Automated business process systems are regarded as one of the next Big Things in the software industry. Legacy business process systems often require much end user interaction such as manual management of information, verification of data, tracking of data as it transits the process, etc. Many are dependent on particular legacy software applications' proprietary binary file formats.

Basically, an automated system provides exacting interfaces so that the system handles the data, workflow and intelligent routing of documents, and users concentrate on the decision points and architecture of the process itself. The Workflow Management Coalition defines workflow, in general, as "the automation of a business process, in whole or parts, where documents, information or tasks are passed from one participant to another to be processed, according to a set of procedural rules," where participant can mean either a human being or another automated process.

To get from the partial automation of legacy systems to the full automation of future systems, information transfer must be accelerated systematically and human participants must be provided with interfaces that enable rapid reorganizing of people, resources, documents, eMail, messaging, wiki collaborative documents, direct access to data bindings within those documents, and the ability to push change into advanced workflows managed by "the system". There is simply no practical way to do this without having a universal file format based on the XML portable document with system-bound data model. The universal XML file formats exist.

The systems-level XML hubs are coming on strong as evidenced by the Exchange/SharePoint/Groove Hub, Workplace/Lotus Notes/Domino, Scalix, and Zimbra. This is where data and various information streams merge with the portable document model, with information streams collapsing into a single easy to schedule and manage aggregate; the merge occurs in the context of end user project and workflow management interfaces.

Full fidelity conversions of the information that drives our current business processes to universal portable XML document models is the only way we can make that productivity leap to fully automated business processes. Every time there is a loss of fidelity in data conversion we basically lose a node in the process. We have to go manual and fix a problem that should have been handled by the system. That impedes any potential productivity gains.

The major problem with anything less than full fidelity conversion is that human intervention to fix the artifacts is required. This is both disruptive and costly. Imagine a situation where an important binary document is worked in MSOffice and NOT converted to ODF at the headpoint by daVinci. That binary document then flows through the system and is converted to ODF and further processed at an OpenOffice workstation. There is some loss of presentation, but no big deal for the OpenOffice wiz. From there it goes to a Workplace enabled workstation where it is processed and responded to. No problemo, WorkPlace can read and write ODF perfectly. Including an ODF-XForms document. From there it goes back to an MSOffice workstation for review and data extraction. The individual sends it back to the WorkPlace desktop and asks for a Word 2000 .doc format. If it is an ODF-XForms document, there is no way to convert it back to Word 2000 binary .doc. If it is a compound document of any complexity, there will be loss in the conversion. No matter what happens, this process simply isn't going to work. In fact, no one in their right mind would ever set up or allow for such a poorly mixed environment in the first place.

Enter daVinci. The conversion of binary documents occurs at the headpoint of the process, Microsoft Word, and it is full fidelity. No need to waste expensive human attention fixing conversion artifacts. Once the process is in ODF, these documents can be passed through a mixed environment without loss of that original fidelity.

Let's take for example a more extensive and document centric business process. One that handles a trillion dollars per year in transaction-after-transaction volumes; the real estate transaction industry. For each and every transaction many professional participants must come together, exchange hundreds of documents, work under a strict and unforgiving schedule, and do this with a perfection and data exactness that, if it fails, could entirely wipe out the life time savings and asset accumulations of their clients. Today the entire real estate transaction is driven by paper documents, just as it was 50 years ago. The only difference is that today much of data and document content is provided by back end data processing systems. Almost none of which are able to talk to each other.

The documents involved must be continually reviewed and re written to make certain the data and information is current. The cost and availability of a mortgage changes by the hour. A bank might pre-approve the buyer and the property to be purchased. Even lock in the rates. But there is always the possibility that secondary mortgage market evaluations, purchases, and constantly changing loan ratios can take a lender out of the market on the very day they are scheduled to deliver a loan product. It's all an incredibly time-sensitive process.

The average real estate transaction brings together many professionals: buyer and seller representative real estate agents, their brokers and transaction coordinators, the buyers and sellers, their attorneys, a lender, a title company, inspection companies, insurance companies, appraisers, contractors filing contingency removal estimates, and key information systems providers such as the Multiple Listing Service, county records, probate courts, and courts of law recording such things as liens and encumbrances. All in all, every real estate transaction has at least twelve professionals, each having to deliver specific documents and reports, and each having a back end data processing system different and unable to speak with any other back end system. It is a mess. It is impossible to automate this process without a universal digital file format that can be produced in and translated back into the many proprietary back end systems involved. And do so with absolutely perfect fidelity of document conversion and data extraction.

So here we have a paper driven process that costs consumers upwards of 12% of the actual transaction. And it's a trillion dollar a year industry once all the before and after sale costs are factored in.

A common situation in the real estate industry is that of the professionals having to exchange documents electronically. Faxing is great for paper, but there is a problem of version control as the volume of documents approaches the final closing. Most of the industry relies on eMail attachments as the primary exchange method. What a nightmare. To exchange documents, participants first have to synchronize the versions of Microsoft Office used to produce these usually compound and data centric transaction instruments. They call it version madness in that there is no way to exchange Microsoft Office documents without first having everyone involved using exactly the same version. Meaning, the quality of fidelity conversion between different versions of MSOffice is a killer. Data loss occurs unpredictably. It is unacceptable.

How do you force everyone who must participate in a real estate transaction to all use the exact same versions of software? Well, the same way governments routinely force tax payers to use the same versions of Microsoft Office. You either get the right versions or you don't get to participate.

There are other problems with eMail attachments of proprietary binary documents. Let's say you are exchanging a sales report spreadsheet with inventory, pricing, availability, special discounts or price cuts, sales, customer name and ID, etc. Every time that spreadsheet changes hands, the data has to be verified and updated. The more rapid and wide the exchange, the more intense the data management - data extraction problem becomes. It is a mess. A dangerous and risky mess.

Those spreadsheets could have important formulas and data bindings. Using the XForms or Jabber P2P data binding models, an ODF spreadsheet document can manage itself. Every time the sales people connect to the Internet, the inventory, discounts, sales totals, etc fields can be updated by "the system" with exacting precision. The end user no longer has to worry about data verification and integrity. The system takes care of it, no matter how fast and furious the changes occur. It's automated within the document. Bullet proof. The same can be done with a Microsoft EOOXML document.

However, try to convert one of these documents, and risk dropping formulas, data bindings, rules based routing, or workflow scripting, and the whole process will come to a screeching halt. As it should. This is but another way some proprietary vendors lock in their customers. If you can't do a full fidelity conversion, and be confident of the results, the only alternative is to turn to a single vendor and year after year pay the price.

Enter the OpenDocument Foundation. Since early 2003, a loose gaggle of ODF developers had been seeking methods to improve the ("OOo") conversion of Microsoft Office binary formats to what later evolved into ODF. By December of 2003, the OOo community actually launched an official project to develop an ODF plugin for Microsoft Office. The difficulties were enormous, and at the time there was no perceived “business case” for the effort that might have justified allocation of sufficient resources from the Sun Microsystems StarOffice group, the primary developers for

However, as Microsoft began its own long march to XML formats, openly discussing important MS binary format to XML conversion issues, these “difficulties” began to be understood. In 2005 there were a series of breakthroughs based partly on released information from Microsoft. The breakthroughs also sprang from our own intuition of what was really behind some of the oddities in the Microsoft approach. Here's a quick list of reference materials Microsoft provided that really helped:

When Massachusetts put out their request for information, as it happened we were only an installation program away from our first ODF 1.0 compatible plugin for Microsoft Word, the most heavily-used application in Microsoft Office. In June Massachusetts sent us our first set of 150 primary importance test documents. We sent them their first plugin install, which was dialed for perfect interoperability with the ODF 1.0-compliant reference implementation, OOo 2.0. In August, we sent Massachusetts the first da Vinci version of the plugin, which was dialed for perfect, 100% round trip conversion fidelity with the legacy of Microsoft Word binaries.

A lot happened in Massachusetts between June and August of 2006. The short form is that as Massachusetts came to understand how our plugin worked, we came to understand the true nature of the challenges they faced. We saw up close and personal the challenges of the Microsoft Office bound business processes and the need for perfect fidelity “round-tripping” of documents between different formats within automated business processes. The Microsoft assistive technologies were never an issue for us because the plugin works natively within Word. Microsoft Word handles the assistive technologies, and all the plugin does is the conversions from Word's in-memory-binary-representation to ODF <> and back.

The situation with the MSOffice bound business processes was such that everyone agreed we had to have a much higher fidelity converting those binaries than the extraordinary 85 per cent fidelity credited to the file conversion engine. We had to have perfect fidelity because there was no reasonable expectation of ever successfully migrating those business processes to a Microsoft Office alternative like ODF-ready, StarOffice, WorkPlace or Novell Office. Such a re-engineering of the business processes would be costly and beyond disruptive.

If however we could achieve full fidelity conversions of legacy Microsoft binary documents to ODF, and were able to guarantee the roundtrip process of these newly christened ODF documents in a mixed desktop environment, one comprised of ODF-enabled Microsoft Office,, Novell Office, WorkPlace, and KOffice, the existing MSOffice bound business processes could continue being used even as new ODF ready workstations were added to the workflow. Massachusetts could migrate to non-Microsoft Office software in manageable phases, restoring competition -- and sanity -- to the Commonwealth's software procurement program.

If on the other hand, there is no full fidelity conversion to ODF of legacy documents available at the head point of the migration -- Microsoft Office -- then the business process will break under the weight of users having to stop everything to fix and repair the artifacts of lossy file conversions. What Massachusetts discovered is that users will immediately revert to a Microsoft-only process wherever the business process system breaks down due to conversion fidelity problems. It is a productivity killer and a show stopper for migration to ODF-supporting software.

Most people believe that full fidelity file conversion is impossible, largely influenced by the poor fidelity generally experienced in converting documents from one vendor's binary file formats to another vendor's binary file formats. The major reason some people transfer such beliefs to the daVinci plugin is that they do not realize the plugin works natively inside Microsoft Word. The Word program itself performs the conversions using its internal processes for native support of a file format. The da Vinci plugin triggers the native Microsoft Word conversion process, intercepts the output at precisely the right moment, and maps it to ODF. When opening an ODF document, da Vinci simply reverses the process (although there is nothing simple about what da Vinci does or how da Vinci does this).

Think of it this way. The Microsoft Compatibility Pack is a EOOXML plugin for older versions of Microsoft Office. If these versions of Microsoft Office can use a plugin to perfect a conversion process between legacy binary formats and EOOXML, the same process can be used for ODF. In fact, it's actually easier to perfect a similar conversion process to and from ODF instead of EOOXML, particularly for Microsoft. Easier in that the ODF specification is lean and clean by comparison and very carefully structured. Even easier if you have the blueprints for those binary file formats.

My educated guess is that it would take all of two weeks for Microsoft to write a Microsoft Word ODF plugin. Indeed, Microsoft developers reportedly told Massachusetts that it would be "trivial" for them to implement ODF in Microsoft Office. If they can do it for something as complex and convoluted as EOOXML, ODF will be a snap. The trick is not in XML. It's in having the secret legacy binary format blueprints.

There's a reason we call our plugin “da Vinci”. Yeah, you guessed it. There is in fact a secret code needed to unravel the remaining mysteries of the legacy binaries; the elusive 15% just beyond the reach of the conversion engine.

The da Vinci plugin is conformant with version 1.2 of the ODF specification still working it's way through the OASIS ODF sub-committees and is dependent on features of the draft specification that may conceivably change. Massachusetts was well aware of this fact, and knew we were treading on dangerous ground because ODF 1.2 was not yet final. They gave us the go-ahead anyway. Such is the importance of round trip high fidelity conversions with the legacy binaries.

Why ODF 1.2? Microsoft is notorious for frequently changing its file formats. There is even a Microsoft document jarred loose in the antitrust litigation where they brag about maintaining a "moving target" for other vendors' efforts to achieve interoperability with Microsoft software. For example, there is no single "DOC" file format. There are versions after versions of just the word processing format, all different. And their specifications are still secret. Imagine the volumes of unspecified binary objects, application version-specific or add-on specific processing instructions, and aging system dependencies buried in those billions of legacy binary files. There is no telling when or where they will pop up and da Vinci will have to somehow handle what are otherwise one unmappable black hole after another.

The easy solution is to simply wrap (enclose) these dark objects in foreign element tags. While this method is perfectly legal ODF because of ODF's strong interoperability features, see e.g., ODF v. 1.1 section 1.5, it comes at the cost of interoperability with other ODF-ready applications. Sure, inside da Vinci-enabled Microsoft Word the conversions and handling are perfect. Absolutely perfect. Send one of these dark object loaded documents to OpenOffice though and the application has no idea what to do with it.

ODF 1.2 provides da Vinci with two important advantages. The first is that ODF 1.2 ready applications “must” preserve tags whether they understand them or not. This is important to round-tripping. ODF 1.2 applications become routers of documents, instead of document end points.

The second aspect of ODF 1.2 is that of metadata XML/RDF based flexibility. Instead of wrapping these dark objects in a non descriptive tag, ODF 1.2 will enable da Vinci to wrap with a loosely descriptive generic element tag, but then nail into the more flexible attribute model everything da Vinci can describe about the blob. This gives ODF 1.l2 ready application like OpenOffice or KOffice a better than even shot at reading and rendering dark objects that would otherwise have been dropped. Over time, the elusive 15% of lossage during their conversions of Microsoft binary formats will disappear.

This is a far better way of handling dark objects than what EOOXML tries to do. Under that specification, the implementing application simply wraps the object and passes it along. There is no effort to describe these objects in XML-speak except for the tag name. The specifics needed to render these objects are left out. Which means, only those applications with access to the secret binary blueprint can properly render these dark spots.

As a proof of concept, da Vinci provides a most valuable service. First of all it proves the ODF can handle anything EOOXML can handle. ODF can handle everything and anything Microsoft Word can throw at it. Second, da Vinci proves that Microsoft developers could at least as easily have written an ODF plugin for Microsoft Word as they did an EOOXML plugin. For Microsoft, knowing the secret binary blueprint as they do, this simply can not be seen as a technical challenge. It was purely a business decision. Third, da Vinci demonstrates that it is possible to perfect a migration to ODF at minimal cost of disruption to existing business processes and application bound routines. The migration to ODF is possible. The starting point of this migration is exactly where Massachusetts ITD thought it would be – with ODF plugins forMicrosoft Office.

We have not yet developed plugins for the other major applications in Microsoft Office; Excel and Powerpoint. However, we have investigated the involved issues enough to have confidence that those tasks are also achievable. The proof of concept and working prototype of Excel is completed. Only a proof of concept for PowerPoint exists.

How to:

Interact with Microsoft business processes using OpenDocument

One of the interoperability challenges the OpenDocument format faces is that of the developers of existing non-Microsoft applications trying to implement a file format that follows after the fact of the design and development of the specific layout engines. The tradition of productivity applications is that the application was written with the file format an afterthought based on the in-memory binary representation needs of the application. Every application had a perfect binary dump of these in-memory-representations, and from there conversions to other file formats began. Early and open cooperation between existing application providers in the specification development process is essential for later interoperability.

One hope for ODF is that a new generation of ODF-ready applications will appear. And these applications will be geared from the git-go to produce perfect, highly interoperable, and round trip ready ODF.

To that end the da Vinci developers began work on the InfoSet Engine and API. This is designed to be a lightweight conversion (and someday layout) engine useful to mobile devices, server side applications, rapid development desktop productivity applications, and the easy ODF enabling of IDE's. The principal is the same as da Vinci except that the InfoSet Engine must do the MS binary conversions without the native assistance of the Microsoft Office applications. That is a step for which da Vinci relies on Microsoft Word to do natively.

So it's safe to say that the more we watch da Vinci do its work, the more we learn about how Microsoft Office perfects its own conversions. This knowledge is then passed to the InfoSet conversion algorithms. Going native gives one a whole new perspective on an enormous problem that spans decades.

The InfoSet API is intended to provide developers with easy-to-call interfaces that do all the conversions for their applications and services. We have not yet begun work on the InfoSet layout engine, but so far this is one of the more promising ODF projects we've worked on, maybe more important and certainly more lasting than da Vinci. We believe that da Vinci will have its day, cracking open Microsoft Office. But that day will only last until the Microsoft Office bound business processes and their related binaries are migrated to ODF-ready information processing chains. After that, (perhaps a three-year process) the day belongs to InfoSet and apps fully supporting ODF such as

InfoSet is important because the world is moving towards Internet-enabled information processing chains that mesh desktops, servers and device-based information systems. We need ODF-ready chains, and the only way to get there is to make it as easy as possible for application developers to provide interop-perfect ODF.

To understand this urgency, one need only to look at the Microsoft Vista Stack, an information processing chain where every application speaks perfect EOOXML. The head point of the Vista chain is of course Microsoft Office. But the core of this chain is the Exchange/SharePoint/Groove Hub. The Hub is where information is aggregated, sorted, managed, scheduled, re purposed and bound into intelligent collaborative workflows. The driving force is that of the portable XML document, with data being bound to interactive document objects, extracted and worked as the business process flow demands.

These chains are extremely productive. So much so that i believe every Microsoft Office business process will migrate as soon as possible to either the Vista – EOOXML chain, or, to an ODF ready alternative.

Some interesting thoughts here. The da Vinci plugin and those on the development road map for the other major Microsoft Office apps will be able to convert the Microsoft Office head point to an ODF pump. This opens the way for ODF ready Hubs. Zimbra comes to mind. An InfoSet-enabled version of Zimbra (just an example :). Once the Microsoft Office bound business process moves to the enhanced productivity of an ODF ready Hub, the desktop workgroup space opens up for any ODF enabled alternative to Microsoft Office. That means, for example, any Linux desktop running or KOffice can fully participate in these Hub-hosted workflows.

Many will no doubt disagree. But I personally believe the above information processing chain scenario is the secret to Linux desktops running wild and finally replacing the 485 million Microsoft Office bound desktops that dominate business.

Related information

The da Vinci plugin and InfoSet Engine – API have been presented at technology conferences and to interested governments serious about migration to ODF. This includes:

  • EU IDABC Experts Group Contacts: Barbara Held, Peter Strickx, Elmar Geese

  • Commonwealth of Massachusetts ITD Contacts: Louis Gutierrez, Timothy Vaverchak, Claudia Bowman

  • State of California ITD Contacts: Bill Welty, CIO of the Air Resources Board

The Massachusetts Plugin Feature Priority List:

  • Microsoft Word perfect conversion fidelity plugin (ODF 1.2)
  • Plugin Accessiblity Enhanced Feature Set (ODF 1.1 compatible)
  • MSExcel perfect conversion fidelity plugin (ODF 1.2 with formulas)
  • MSPowerPoint plugin
  • Roundtrip Interoperability with ODF 1.2 version of
  • XForms Interface for da Vinci enabled Microsoft Word
  • XForms Interface for MSExcel and MSPowerPoint
  • PDF - ODF Digital Signature Interface for Microsoft Office

OASIS Open Office XML Application Vendors:

The interoperability of any universal file format very much depends on the open cooperation of existing application vendors. New applications written to the universal file format specification will of course have near perfect interoperability, but there are serious application layout and feature impedance difficulties for existing applications trying to implement a universal file format, no matter how open or consensus driven the specification development process happens to be. ODF was most fortunate to have a number desktop and enterprise application developers participati ng in the process. Although Microsoft was part of the original group of founding members, and continues to maintain that membership in the OASIS ODF XML TC, they have never actively participated. Their "observer" status did however give Microsoft direct access to all discussions, meetings, conference minutes, eMail exchanges, listserves, documents and proposals. Clearly if Microsoft had participated in the ODF process, the stubbornly defiant interoperability problems we see today between ODF and Microsoft's own proprietary XML could have been resolved.


Post a Comment