Migration to XML_Docbook

Subject: Migration to XML_Docbook
From: Dan Emory <danemory -at- primenet -dot- com>
To: "Free Framers" <framers -at- omsys -dot- com>, "FrameSGML List" <FrameSGML -at- onelist -dot- com>, "TECHWR-L" <techwr-l -at- lists -dot- raycomm -dot- com>
Date: Wed, 17 May 2000 02:45:48 -0700

Below is a very preliminary analysis I was asked to perform for someone interested in converting unstructured FM docs to structured FM+SGML docs that conform to the Docbook DTD. I am posting it on the TECHWR, FrameSGML, and framers lists because I believe it may be of general interest. I invite comments, particularly if you disagree with anything stated herein.
==============================================
I have examined the FM document you sent. It appears to be consistently tagged. The paragraph and character tagging scheme is quite simple, and reflects a relatively small number of document object types (e.g., body text, bulleted list, datafile). The Docbook DTD defines approximately 120 different elements. My own opinion is that Docbook is the DTD from hell, and should be avoided at all costs unless you are being forced to use it.

1. CONVERSION TO STRUCTURED FM+SGML DOCUMENTS
Obviously, any kind of automated conversion method to go from FM unstructured to FM+SGML structured in conformance with a DTD requires that the paragraph and character tags be unambiguously mappable to applicable elements in the DTD. Furthermore, there is no way that attribute values for elements in the resulting converted docs can be properly assigned (i.e., all values would be initialized to their DTD-specified default value, if any).

FM+SGML has a built-in capability to convert unstructured docs to structured ones, using structure rule tables to map the various tagged document objects in the unstructured doc to the corresponding SGML elements. When there is a 1:1 relationship (as opposed to a 1 to many or many to 1) of each tagged object to a corresponding SGML element, structure rule tables can do a fairly good job, however manual cleanup work is inevitable to make the converted document fully conformant to the DTD/EDD, and to apply the appropriate attribute values.

I conclude that your documents probably do not fit well with the above conversion requirements, particularly for conversion to a DTD/EDD as complex as Docbook, however, a more thorough analysis might show otherwise, particularly if you decide to develop your own DTD/EDD whose structure closely resembles that of your existing documents.

There is one additional requirement that must be met for unstructured to structured conversions to be possible: The entire FM document must have a single text flow.

Obviously, once you've converted an unstructured FM document to a structured FM+SGML one, you never again want to revert back to the unstructured one for editing or anything else. After conversion, you should discard the original (first verifying, of course, that everything was properly converted).

2. VERSION CONTROL
You mention keeping the content of these documents (in .txt or .mif format) in a CVS. Clearly, storing .txt or .mif is not the answer. Instead, you should export the documents from FM+SGML to XML and store that.

XML has many new features (including Unicode) that make it superior to SGML (and certainly cosmically better than ASCII text or MIF) for database storage. Storage in this form has the added advantage of allowing you to maintain revision/version control at any desired level of granularity, because the proper kind of database repository can parse the document into its individual components (i.e., elements and external entities (e.g., graphics)), maintain revision/version information on each component, and retrieve any desired portion of any desired version.

A CVS/data repository that stores XML can become the sole source of controlled documents for an entire enterprise. Information is retrieved from the database by human and non-human queries. Middleware (e.g., Omnimark) is used to process the information extracted by these queries to match the requirements specified by the users. XSL style sheets (also part of the XML standard) can be created by the middleware to format the information when it is viewed in an XML-aware browser.

3. ROUND-TRIPPING BETWEEN THE CVS AND FM_SGML
Ideally, you would originate, revise, and edit your structured documents in the WYSIWYG environment of FM+SGML, export them as XML for storage in the database repository, and check the documents (or any portion thereof ) directly out of the database into FM+SGML for incorporating changes, as well as for printing them or converting them to PDF or other formats. However, XML round-tripping is not possible because FM+SGML (including the new 6.0 version) can export XML but cannot import it. Consequently, if you export your documents as XML for storage in the database, you'll have to use a middleware product like OmniMark to convert the XML. document instances to SGML before they can be imported into FM+SGML. This conversion from XML to SGML also requires that Unicode characters with ANSI numbers above 127 (as well as any other non-english characters), be converted to their equivalent ISO character set entity references, since FM+SGML cannot process Unicode input.

It is extremely unfortunate that FM+SGML (including the new version 6.0) does not implement Unicode. If Unicode had been fully implemented, it would have been possible to use multi-language Unicode fonts with FM+SGML, which would have greatly facilitated language translations, including the intermixing of two or more languages in the same document. The intermixed languages would have been fully preserved on export to, or import from, XML.

4. LINK PROBLEMS
Another problem is links (i.e., cross-references and hypertext links). FrameMaker implements cross-reference links using ID and IDREF attributes which conform to the SGML standard. This is OK when all such links are internal to the exported SGML document instance, but external cross-references created in FM+SGML do not produce links that work when the document is exported to SGML, because FM+SGML, on export, cannot produce an IDREF attribute value that includes the location of the external file (This is a limitation of SGML). To make it worse, neither the internal nor the external cross-references work if the document is exported to XML, because links in XML are implemented differently, as specified in the XLink and XPointer portions of the XML standard. You could create XML-conformant equivalents of the ID and IDREF attributes in the FM+SGML EDD (and the corresponding DTD), however, these attribute values, unlike FM cross-references, will have to be manually entered for the elements at each end of each link, and the links will not work in FM+SGML..

5. FORMATTING
You also mentioned that it would be nice to be able to preserve the "look and feel" of the existing unstructured documents after they've been converted to structured documents. This is where FM+SGML really shines. All of the formatting specifications are defined in the EDD and its companion template. Consequently, you can make the converted FM+SGML documents closely resemble the formatting of the current documents. Also, when you import an XML or SGML document instance into FM+SGML, the formatting specified in the EDD is applied.

When you export an XML document instance, you can also produce a Cascading Style Sheet (CSS) that is derived from the formatting specifications in the EDD and its companion template. Thus, if you open an exported XML document instance with a CSS in an XML-aware browser such as IE5, the formatting (but not necessarily the layout) of the original document will be replicated.

CONCLUSION
As you can see, conversion to structured FM+SGML documents is not a trivial undertaking, and the full utilization of all the benefits that can be derived therefrom is made difficult by some of FM+SGML's current limitations. The initial investment is high, but if your operation is large enough, the savings possible in areas such as author productivity, document quality assurance, revision control, information reuse, and information repurposing will pay back those costs many times over.




====================
| Nullius in Verba |
====================
Dan Emory, Dan Emory & Associates
FrameMaker/FrameMaker+SGML Document Design & Database Publishing
Voice/Fax: 949-722-8971 E-Mail: danemory -at- primenet -dot- com
10044 Adams Ave. #208, Huntington Beach, CA 92646
---Subscribe to the "Free Framers" list by sending a message to
majordomo -at- omsys -dot- com with "subscribe framers" (no quotes) in the body.






Previous by Author: RE: Correctness of bus-"master"ing
Next by Author: Structured Document Design for XML or SGML
Previous by Thread: OT: Marshall McLuhan Survey
Next by Thread: technical publications at QANTAS


What this post helpful? Share it with friends and colleagues:


Sponsored Ads