Monday, January 22, 2007

Running On OpenDocument Inside of Microsoft Office

Perfect Conversion Fidelity & The daVinci ODF Plugin for Microsoft Office

By now it's clear that Microsoft's Ecma-approved Open Office XML file format specification is filled with contradictions to existing ISO/IEC standards products. Beyond the traditions of international standards consideration, there is a second perhaps even more important concern: How do we reasonably migrate from a world where Microsoft Office bound business processes drive critically important economic, governmental and organizational concerns? How do we migrate these processes to OpenDocument ("ODF"), which is an international standard designed for interoperability? And is there any possibility of converting with acceptable fidelity the billions of binary documents trapped in Microsoft's proprietary file formats? The world wants to move to ODF XML. But the question is, "Can This Be Done?". And further, "Can this be done without costly disruption to our day to day business processes?" Microsoft has long claimed that only their proprietary Office Open XML could convert those billions of binaries to XML without loss of fidelity (data loss, or "lossiness"). The claimed that ODF was inadequate and unable to handle the rich feature set of Microsoft Office. This is a strange claim in that the "X" in XML stands for eXtensible. Since XML formats are eXtensible, of course ODF can handle anything Microsoft Office or those billions of binary documents have to throw at it. The real truth about this issue is that no one has ever been able to crack the secret code of those Microsoft binaries that hold so much of the world's documents. And Microsoft is not about to disclose the specifications for those formats. Years of reverse engineering efforts by non-Microsoft develoeprs have brought us within range, but the binaries are an ever-moving target for interoperability. With each new version of Microsoft Office applications, the binary formats change arbitrarily. For example, there is no single DOC format; there are a host of DOC formats that all use the same file extension, DOC. Do not kid yourself. Microsoft is not about to be the one that converts all those binary documents to the open standard ODF. Instead, they set out to convert those billions of binaries to their own proprietary XML. So, the next question is of course, "Is there any possibility of converting those binaries to Microsoft's EOOXML and from there a transformation to ODF?" After all, easy and perfect transformation is the promise of XML. Microsoft's Steve Ballmer answers that question for us when he claims that conversions between EOOXML and ODF can be done, but that Microsoft's plugin will never provide full fidelity conversions between EOOXML and ODF; in other words, only a core set of features will be converted and the conversions of documents implementing other features will be lossy, resulting in data loss in largely unpredictable situations, depending on the differences in particular documents. So Microsoft's solution will not allow the automated conversion of those billions of binary documents to ODF. Not without data loss. We'll have transformation processes between EOOXML and ODF, it just won't be worth doing unless you are willing to manually compare documents when rendered both on Microsoft's applications and on an application that fully supports ODF, just to ensure that no crucial data was lost in the conversion. Essentially Microsoft is claiming that only they can convert the billions of binary documents to XML with the fidelity needed, perfect fidelity (no lossiness), and only to Microsoft's flavor of XML, EOOXML. And Microsoft wants the ISO to award it a monopoly in converting its legacy formats to XML by making its own personal XML its own personal international standard. Microsoft is also funding an open source project to perfect conversions (more properly, "transformations" in XML lingo) between EOOXML and ODF. Maybe just to prove they are good guys who can be trusted with our information, and that they are not out to "replace" ODF, but rather to perfect the conversion of those legacy billions of binaries to XML. One objective of the Microsoft-Novell Translator Project is to provide an easy to install EOOXML <> ODF plugin for both Microsoft Office and OpenOffice. Betas for both of these plugins are expected to be released this month. Because this work is somewhat in the open, we are well aware of the intransigence of continuing conversion fidelity problems in both versions. (The Novell work on the OpenOffice.org Writer Translator plugin is complete, but not open to the public, or contributed back to OpenOffice.org -- yet). Note that the Translator Project is based on a XSL Transformation process. So it's expected to be both application and platform independent. The only question is, "Can they achieve the quality of fidelity needed to be of any use?" Steve Ballmer says no. And he's funding the project. There's also the consideration that a scorpion can't help but continue to act like a scorpion. This famous quote from Bill Gates continues to haunt the technology industry to this day. He might as well been referencing the Microsoft binary file formats:
"I doubt they [Digital Research] will be able to clone Windows. It is very difficult to do technically, we have made it a moving target and we have some visual copyright and patent protection. I believe people underestimate the impact DR-DOS has had on us in terms of pricing." (May 18, 1989 - Bill Gates)

The ODF Plugins Appear::

When Massachusetts announced their Request for Information concerning the possibility of an ODF Plugin for Microsoft Office, there were many responses. Each with a different approach to the problem. Microsoft responded to the RFi with the promise of their XSLT based Translator Plugin. Sun provided two different plugin designs; the first based on a OpenOffice Server side conversion, the second based on a C# routine connecting Microsoft Office functions to a locally installed OpenOffice conversion. In both cases it was the OpenOffice conversion engine we know and love that was doing the work of MS binary formats to ODF and back. Importantly, neither the Microsoft nor the Sun plugins allow ODF to be set as the default file save format in Microsoft Office, what is known as "native support" for a file format, a situation sure to produce lots of accidental non-ODF files. Imagine what it would be like to have to train yourself never to hit the "Save file" option or keystroke shortcut in Microsoft Office. Instead, you must open a special menu option to save a file as ODF. Do you think you just might accidentally use the normal file save commands every now and then? An unpredictable mixture of file formats in a network can have unpredictable consequences, particularly when automated processes are involved. A third ODF Plugin for Microsoft Office was proposed and submitted by the OpenDocument Foundation. This plugin conversion process was based on internally triggering a Microsoft Office native conversion process; one the Foundation believes is the same or similar to that which is used when the EOOXML Compatibility Pack is installed, and the Microsoft Office in-memory-binary representation of the document ("IMBR") is converted to EOOXML. Unlike the other plugins, the Foundation plugin adds full native support for the ODF file formats to Microsoft Office. (The present version adds that support to Microsoft Word; later versions will add it to Excel and Powerpoint.) You can use the normal file save dialogs and commands. Moreover, you will not need to study/rewrite all of your existing scripts to ensure that files are saved to the right format. Just set the default file save format to ODF. There were other submissions, but these are three we know the most about. They each represent a different approach to the problem of converting those billions of binary documents to XML. They each offer a different quality of conversion fidelity. Only one allows ODF to be set as the default file save format in Microsoft Office. The real question is whether or not any of the three provide file conversion fidelity acceptable enough so that there would be little or no disruption to existing Microsoft Office bound business processes, line of business dependencies and the functioning of assistive technology add-ons. In short, without near-perfect conversion fidelity, there is no measure of "interoperability" worth talking about. That is the reality of Microsoft's EOOXML blackmail attempt. For years they have withheld from competitors and customers the secret binary file format details needed for perfect conversions, reserving that advantage for themselves. Only Microsoft holds the key to unlock your information from Microsoft applications it remains bound to. So because they alone hold the key to so much of our binary-bound information, they insist that the world must adopt their proprietary, self serving, application and platform bound, monopoly leveraging XML as an international standard. We are being blackmailed by the problems of converting those legacy billions of binary documents to XML. I have a counter offer that ISO/IEC might consider; Give us the keys to those legacy binaries and the documentation for the new MSXML InfoSet binaries that first appeared in Microsoft Office EXcel 2007, and we'll give you international standardization for EOOXML. A fair trade i think, because it will break the monopolist's grip, level the competitive playing field, and restore competition wherever desktop, server and device systems need to interconnect and exchange information.

Three Conversion Approaches to Consider::

What we have then are three conversion methods, all enabled through an easy to install plugin model, and each with a different level of conversion fidelity:
  • The MS Translator XSLT method (EOOXML <> ODF) :: note that this initially was an application and platform-independent approach. In the strict sense, this is a "transformation", not a "conversion"
  • The Sun external OpenOffice.org conversion engine method (MS Binary Files <> ODF) :: note that the OOo conversion engine is based on years of reverse engineering needed to understand the secret structure of those mysterious and enigmatic billions of binary documents. A secret only Microsoft can unlock with perfect fidelity.
  • The Foundation's daVinci "internal" conversion process (MS in-memory-binary representation <> ODF) :: note that this process harnesses the internal conversion methods of Microsoft Office applications in much the same way as the EOOXML Compatibility Pack.

A brief description of the daVinci internal process:

So how does daVinci do this magic of triggering an internal process and letting the native resources of Microsoft Office perfect the conversion? Well, the first problem was getting inside Microsoft Office applications and working "natively". The know how for this was provided by Microsoft themselves when they mounted their first MSXML add-on to Microsoft Office 2003. And what a great job they did! Once inside Microsoft Office and working natively, the entire view of how to best convert MS binaries to ODF changes radically. Rather than trying to crack the intransigent and enigmatic binaries externally, on the inside you simply let the Microsoft Office applications do it for you. We don't know for sure, but there is every indication that daVinci works very similar to how the EOOXML Compatibility Pack works. There is no doubt that without the public information Microsoft has provided concerning the early versions of MSXML, we would not have had the series of breakthrough discoveries that make daVinci possible. So the key to daVinci is in letting Microsoft Office apps handle the billions of binary documents, especially their conversion to IMBR (the Microsoft Office apps in-memory-binary representation). Internally, when a conversion process of any sort is triggered, Microsoft Office apps follow pretty much the same routine. There is a point where this internal conversion process can be intercepted and routed (mapped) to a different, non Microsoft, file format structure. Imagine if an internal conversion process from IMBR <> EOOXML is triggered, and you intercept the process the moment before mapping to EOOXML begins. And you reroute by mapping to ODF. That's daVinci. DaVinci triggers, intercepts, and maps to ODF. It could just as easily be configured to map to Chinese UOF. Or to EOOXML (well, forget the easy part regarding EOOXML; the haphazard and sprawling structure of EOOXML makes this mapping difficult - but as the MS Compatibility Pack proves, it is possible :). daVinci could even be configured to map to Romanian XML or Oracle XML. The conversion quality of the daVinci process really depends on the flexibility of the XML schemas it is mapping to. Let me say that again, "The conversion quality of the daVinci process really depends on the flexibility of the XML schemas it is mapping to". Since EOOXML was made expressly and specifically for mapping Microsoft Office IMBR to, you better get perfect fidelity. What about ODF? Yes, you can get the same perfect fidelity. The flexibility is there, and has been there since the February 2003 addition of the <foreign element> tags, section 1.5 of the ODF v1.0 standard (casually referred to as the <microsoft tags> because of what they can do). So yes, if you can break the secret of the proprietary IMBR, understand their hidden structure and function, you absolutely can get perfect-fidelity conversions to ODF and EOOXML. This is an incredible achievement for the OASIS ODF Technical Committee ("TC"). ODF was designed to be a universal file format, totally application and platform-independent, and it has the built in flexibility to easily handle anything the enigmatic billions of binaries might throw at. Tapping into the Microsoft Office IMBR just makes it easier for daVinci to see what's actually happening inside those unspecified binary blocks that blanket the billions of binary documents were trying to convert to XML. As Rob Weir has remarked: "If Microsoft supported ODF 1.0 in Office today, using the foreign attribute support already specified in ODF 1.0, they could achieve backwards compatibility with their legacy documents. There is nothing that prevents them from adding a "DoItLikeWord95" attribute to an ODF document."

Blanketed with Unspecified Binary Objects - The dark spots ::

The real problems of converting those billions of binary documents or working as a near native file format within Microsoft Office has nothing to do with either EOOXML or ODF. And everything to do with the secret, enigmatic binary file formats. Microsoft is busily spinning the world to convince us otherwise, but it only takes one demonstration of daVinci to set things straight. We shouldn't give in to blackmail. Especially blackmail designed to leverage the Microsoft desktop monopoly deep into our future of converged and highly interoperable multi platform systems. Control of the file formats, and keeping them bound to proprietary applications and platforms, is control of everyone's information and information processes. So Microsoft Office will do a great conversion of those billions of binaries to IMBR for us. And when triggered, IMBR will set things up for daVinci to intercept an internal conversion structure and map to ODF. Because there are billions of binary documents out there, with years of file level application feature tweaking and enhancements by independent LOB - business process developers and assistive technology add-ons to deal with, there's no telling what kind of unspecified binary objects daVinci will encounter and have to map to ODF. daVinci needs the mapping flexibility in the XML target structure to place these unspecfied anomalies otherwise called "dark objects". The thing is that these binary object anomalies are unspecified on both ends of the conversion equation. They are unspecified with regard to the historical annals of reverse engineering, which itself is based on the cryptic, enigmatic, and often misleading documentation Microsoft has provided for RTF and the MS binaries. And, they are unspecified by the XML structures at the other end of the conversion equation. Like ODF.

The Skinny on daVinci inside::

When a user loads a binary document (or creates a binary document in Microsoft Office applications), the apps themselves convert the binary documents to IMBR (the Microsoft Office apps in-memory-binary representation). The user works the document in IMBR mode. This means all application features, business process adaptations, assistive technology add-ons, whatever, are available and cooking without disruption or change. When an internal MS conversion process of IMBR is triggered, daVinci intercepts the results, and maps to ODF. The ODF version is saved to file. An internal conversion process is triggered whenever functions like save, save as, open, or open most recent is called for.

Conversion Fidelity & Interoperability ::

We fully believe that ODF version 1.0 provides daVinci with the flexibility we need to hit the same quality of fidelity of conversion of those billions of binary documents that EOOXML promises. Which is to say that ODF 1.0 has long offered Microsoft the same opportunity to convert everything to ODF and back. There is no technical reason for Microsoft not to have implemented ODF. And there is no technical reason for them to now ask that ISO/IEC consider a second universal file format specification as an International standard. But what's beyond the issue of conversion fidelity? Inter application and cross platform interoperability; the ability to transport and exchange documents across many different kinds of information domains without loss of fidelity or structural compromise. Interop is a tall order. Especially after years of living with application bound file formats that only the application and platform vendors can transport and exchange effectively. In their EOOXML pitch, Microsoft promises something called, "Interoperability by design". Translated this means that all Microsoft applications will be designed to work perfectly with EOOXML. Most likely we will also see MS applications able to handle the binary extensions of EOOXML that showed up Excel 2007. This includes desktop, server and device systems written to .NET 3.0 and the Vista platform. To make certain this happens, Microsoft has provided us with a new version VSTO 2005 where they drop support for MSXML and introduce support for EOOXML. They make it easy. The thing is that if you're a non Microsoft application, most likely you won't be able to fully implement EOOXML. Definitely you won't have access to the binary InfoSet extensions of EOOXML. Least ways not without a price, and never if you're a competitor like Oracle, IBM or Sun. ODF Interoperability is open and freely available to anyone wanting to implement ODF. Participation in the OASIS ODF TC specification process is open and affordable. There are no application or platform specific dependencies, or licensing restrictions, or patent encumbrances - legal risks holding anyone back. Universal file format interoperability is a given with ODF. Application interoperability is another matter. Especially existing applications that might have layout engines developed long before ODF became available. With new ODF applications this won't be a problem since they can develop directly to the specification. This is one of the reasons so much work is going into ODF 1.2, to accommodate the differences of traditional layout engines as they implement ODF. We don't have the power or authority of a Microsoft to rewrite every application to work perfectly with ODF. Nor do we have a similar command of the marketplace to force a user base of over 550 million desktops to upgrade to a Vista platform of Microsoft Office 2007 - VSTO -IE 7.0 -Exchange/SharePoint/Groove - MS SQL Server, MS Active Directory Server, etc. So instead, we have ODF 1.2 waiting in the wings. That's where our solution to universal application interop lies.

About the flexibility of ODF 1.0 - The Interop of ODF 1.2 ::

It is true that ODF has had, since February of 2003, an extremely flexible set of tags were added to the specification. They are called the <foreign element> and <alien attribute> tags, and were designed exactly to handle the billions of unspecified and conversion defiant binary object anomalies known to comprise years of Microsoft proprietary binary file format use. When you're mapping from an IMBR conversion process to ODF, you have to have something to put the unknowns in. You have to map to an existing tag. Since the very nature of these dark objects is that they are "unspecified" and previously unknown, they are also outside the range of ODF. Using these tags, daVinci can get perfect fidelity between the billions of binary documents and ODF. And get it right every time, with every Microsoft Office version from 1997 to 2007. But this "perfect fidelity" comes at a high cost of interoperability with other ODF ready applications. Simply put, the other ODF applications have no idea what to do with the daVinci <foreign elements>! An Microsoft Office with daVinci knows. But for all other ODF ready applications these dark objects are still a mystery. In many ways the <foreign element> tags are the equivalent of the what EOOXML does with the same volumes of unspecified legacy tags. They're there, but no one excepting Microsoft Office Compatibility Pack enabled installations know what to do with them. The same is true with daVinci ODF. Only daVinci knows what to do with these dark objects. In other words, from day one ODF has had the exact same means wrapping in proper XML an unspecified binary object or processing instruction as that which EOOXML is now parading about as something absolutely necessary (and unique to EOOXML) for converting those billions of binary documents to XML. With ODF 1.2, daVinci gains the flexibility to map whatever dark objects found in ways that will dramatically improve interop with other ODF 1.2 ready applications. (see the <interop eXtensions> proposal submitted by Florian Reuter to the ODF Metadata SC). Using the new metadata model, daVinci can then proceed to fully describe everything known and intuited about the dark object. Keep in mind that daVinci has an inside view. What daVinci sees is the IMBR context and conversion structure that is missing from a binary file format as well as EOOXML's cryptic tags. This descriptive model will provide every other ODF 1.2 ready application a much better chance to handle and render the dark objects. With ODF 1.0, we were limited in how we describe for interop purposes these unspecified creatures. With ODF 1.2, daVinci can field these objects on the fly, and give other ODF 1.2 ready applications a fighting chance to properly render them. Sticking a binary object into an an XML wrapper is just kicking the can forward. It's passing the problem onto someone else. Yes, it solves the momentary problem of an XML file format plugin running inside Microsoft Office (EOOXML and ODF). No problemo for those users. But it punts the problem of roundtrip interoperability with other ODF ready applications. They are left hanging. With the generic <interop eXtension> approach, and the metadata descriptive model, my guess is that ODF 1.2 ready applications will handle upwards of 98% of these problems instead of having to ignore the entire binary block. Over time of course, we will come to understand, specify properly, and map directly these binary objects. Years of reverse engineering has brought us to upwards of 85% conversion fidelity. Now we need to nail that remaining and highly elusive 15%. Uncompromising demands from Massachusetts and the EU have forced Microsoft to come out in the open with their proprietary XML. They are fighting tooth and nail to keep their application bound binary secrets secret. And with good reason. If we crack that last 15%, and do it in a way that provides users with a totally non disruptive migration to ODF path, the monopoly will have been cracked open. Sometimes i wonder if the ISO/IEC JTC-1 members realize that they have it in their power to do what no government has thus far been able to do - stop the Microsoft monopoly from illegally leveraging their control into other markets, and restore open competition to technology marketplaces. The daVinci ODF 1.2 and ODF 1.0 plugin demonstrations will be made available to ISO/IEC members as positive and irrefutable proof that "it can be done". Hopefully we can get a video demonstration to walk them through daVinci so they can see for themselves. Hope this helps, ~ge~

Notes:

Microsoft joined the original OASIS Open Office XML effort in November 2002 (now OpenDocument or "ODF"). But they refused to participate or comment, instead quietly observing the work of the ODF Technical Committee for the next four years. Meanwhile, they began work on a proprietary XML file format designed specifically and solely to meet the "XML" needs of their Microsoft Office applications and emerging Vista platform of desktop, server and device systems. In 2004, Microsoft presented their proprietary XML effort to the European Union in response to a famous study known as the "Valoris Report". The report recommended that EU governments and organizations mandate information technology purchase requirements based on a far reaching but uncompromising infrastructure of SOA, Open Standards, and Open XML mandates. Valoris also recommended the development of a universal XML file format that was application and platform independent, able to service the portable document needs of an SOA infrastructure stretching over desktops, servers, devices and across the Internet. The universal file they envisioned was tagged "OpenDocument". When Massachusetts followed the EU with a clear mandate for Open Standards and open XML file formats that were recognized by Open Standards bodies, Microsoft formed the MS Ecma 45 workgroup with the objective of developing an open standard XML file format perfectly compatible with the existing Microsoft Office XML file formats. Meaning the end result was in the hands of the Ecma 45 workgroup before they even began work. All that needed to be done was some massive documentation of what in essence is a Microsoft Office binary dump into XML. Or, if you prefer, an XML encoding of Microsoft Office proprietary binary file formats.

Post a Comment