by Rick Jelliffe, O'Reilly Articles
Here are some XML metrics for a large document with almost 180,000 words, tables, lists, sidebars and some graphics. I chose a large document so that bootstrap effects would be minimized. I used the ODF v.1.0 specification, converting it from .SWX to .DOC and .ODT in Open Office 2.0, then converting the .DOC to .DOCX in Word 2007 beta. Then I used a COTS archiver to treat the ODT and DOCX files as ZIP archives, and extracted the XMLfiles containing the basic text and markup: content.xml (ODF) and word/document.xml (MSOOX). I chose to use a .SWX format because I didn't want to have any MS-dependencies in the data, .DOC being proprietary. I also resaved the document to .DOC, re-opened it and re-exported it to .DOCX and extracted the word/document.xml file.
Resaving data is a good trick when doing data conversion, because it removes extraneous information or structures from the source: the first .DOC are what Open Office thinks .DOC looks like, the second .DOC is what Microsoft does things. The numbers seem to support the interpretation that beta MSOOX may be quite a bit less complex than ODF 1.1 at this stage, at least in the sense of using fixed structures more, and simpler in these sense of using fewer elements and attributes. ODF is flatter and has smaller filesize but seems to include more style headers than the MOOX does. The metrics indicate that the use of attributes may be significantly different between the two formats, for example for people looking at data conversion estimation. On the application level, Open Office loads the ODT file much faster than the Word 2007 beta loads the DOCX file. I'd wouldn't be surprised if MSOOX were easier to convert from (because of its regularity, scale and low complexity) while ODF were easier to convert into (because of its richness and flexibility), after the initial hurdle of converting anything to/from either of them was leapt.
Here are some XML metrics for a large document with almost 180,000 words, tables, lists, sidebars and some graphics. I chose a large document so that bootstrap effects would be minimized. I used the ODF v.1.0 specification, converting it from .SWX to .DOC and .ODT in Open Office 2.0, then converting the .DOC to .DOCX in Word 2007 beta. Then I used a COTS archiver to treat the ODT and DOCX files as ZIP archives, and extracted the XMLfiles containing the basic text and markup: content.xml (ODF) and word/document.xml (MSOOX). I chose to use a .SWX format because I didn't want to have any MS-dependencies in the data, .DOC being proprietary. I also resaved the document to .DOC, re-opened it and re-exported it to .DOCX and extracted the word/document.xml file.
Resaving data is a good trick when doing data conversion, because it removes extraneous information or structures from the source: the first .DOC are what Open Office thinks .DOC looks like, the second .DOC is what Microsoft does things. The numbers seem to support the interpretation that beta MSOOX may be quite a bit less complex than ODF 1.1 at this stage, at least in the sense of using fixed structures more, and simpler in these sense of using fewer elements and attributes. ODF is flatter and has smaller filesize but seems to include more style headers than the MOOX does. The metrics indicate that the use of attributes may be significantly different between the two formats, for example for people looking at data conversion estimation. On the application level, Open Office loads the ODT file much faster than the Word 2007 beta loads the DOCX file. I'd wouldn't be surprised if MSOOX were easier to convert from (because of its regularity, scale and low complexity) while ODF were easier to convert into (because of its richness and flexibility), after the initial hurdle of converting anything to/from either of them was leapt.