Incorporating XML in Documentation

Subject: Incorporating XML in Documentation
From: Jason Willebeek-LeMair <jlemair -at- cisco -dot- com>
To: "TECHWR-L" <techwr-l -at- lists -dot- raycomm -dot- com>
Date: Tue, 17 Apr 2001 17:29:16 -0500

Names have been changed to protect the ignorant.

This describes the XML project of a single documentation
team and does not represent any corporate projects.

I deny everything and accept responsibility for nothing.


This describes an XML solution for a single product. It is not
an ideal solution by any means, but it does demonstrate a low-cost
alternative to expensive document management systems for small
groups or projects. The system described here WILL NOT scale
to an enterprise-wide solution.

It is also a good method for a proof-of-concept or learning
implementation without shelling out a lot of $$$.

XML Implementation

Phase I: Defining the Markup and Selecting the Tools

We chose to create our own DTD rather than use an existing one.

We chose DTD rather than schema because 1) schema had
not yet been finalized, 2) there were several competing
schema specs (TREX, RELAX, W3C XML Schema, MS Schema), and
3) most importantly, there was not (and still is not)
widespread support for schema.

We went with home brew rather than an existing DTD (such as
DocBook) because our information set was already pretty
well structured and because the existing DTDs did not quite
meet our needs (they either offered too much or too little).
Besides, it was fun.

<side.note>Six months after we went live, IBM came out with
their DITA architecture, which is remarkably similar to ours
at a high level. So, we like to think that we are as smart
as IBM.</side.note>

Some tips to remember when creating your own markup:

* Think structure, not format. Formatting is handled by
another process. This has been the most difficult part
of our system for our writers to grasp. They keep wanting
to see how things will look, which is impossible (after
all, our hardcopy style is much different than our online

* Think semantics. What role does the markup play? Is it
a menu, a command, a button, a step, an item in a list, etc.

* Think about tag names. You may be tempted to abbreviate
tag names, and it may be good to do so. On the other hand,
if you are anticipating getting people up to speed quickly,
you may not want to. For example, which is more explanatory:
tl or tasklist? Also think of the authoring environment;
most XML authoring tools allow you to pick tags like you
paragraph styles in Frame or word, so having longer names
does not burden the writer. On the other hand, if you are
authoring in NotePad or vi, you may want to abbreviate the
names to save wear and tear on the writer's fingers and

Our XML information architecture has three layers.
The top layer has four DTDs: Book, HTML Help, WinHelp,
and Workbook (we'll get to workbook later). They each
correspond to the type of deliverable. They also organize
our topics slightly differently. For example, the
book DTD has chapters and only allows nesting of topics
within the chapters to go 4 deep. The HTML Help DTD
contains help project information and allows infinite
nesting of our topics (as does the help).

These four layers rest on top of the real meat of the
architecture -- the topics. We defined a topic as a heading
plus the text after that heading. This is basically our
smallest reusable chunk of information, aside from some
boilerplate text and warnings. We have several types of
topics: Procedurebox, Conceptbox, Taskbox, Checklist, and
Fieldbox. Each topic has a unique content model. For example,
a procedurebox contains a Title, an Intro, and a Procedure. The
procedure contains one or more steps, which in turn can have
actions, notes, warnings, cautions, results, etc.

Underlying the topics is our common markup. Typical stuff,
like paragraphs, lists, inline markup, etc.

With this architecture, we can quickly create new top-level
forms (java help, man pages, etc) for new types of
deliverables or insert a new information type (such as a
glossary) into the architecture without too much fuss (i.e.
new work and re-work).

Tools were another matter. Open source XML processors are
pretty abundant. This was important, because this started
out as a side project with no budget. We used Saxon
( and
XALAN ( at the
beginning, because they were free, good, and
compliant. We have added MSXML
( to our bag of
tools also, since their latest release is compliant to the
specs, fast, and has a command-line utility (important for
us--I will get to that in the output section). Since our budget
was US$0.00, NotePad was our editor. You really do not need more
that that to create a DTD, XML, or XSLT file, but I eventually
upgraded to a US$25.00 copy of TextPad (

Phase II: Testing

Naturally, we tested the markup and the process of converting
our HTML source to XML. Found a lot of bugs and overlooked
issues. Always test before you convert.

Total Project Cost So Far: US$25.00 (Don't worry, it gets more
expensive later)

Phase III: Conversion of Existing Source

Since all of our source was in pure HTML, we used HTMLTidy
( to convert the
existing information source to XML. Then, where possible,
we used an XSLT transform to convert it to an approximation
of our markup.

This is an important point -- we could not transform directly
to our markup because our XML is semantically richer than our
HTML (despite the liberal use of Class attributes). For
example, items marked as <b></b> in the HTML may be converted
to <menu></menu> or <field></field>. These items required manual

I suppose we could have been clever and written a script that
examined the contents of the tag and applied the appropriate
XML tag, but we were not clever at that time.

For the cleanup work, we assigned all of the writers a set of
files to work on and gave them a month. We used a commercial
authoring tool (XMetaL was our tool of choice, but there are
other commercial editors available that are arguably just as
good or even better--before selecting an authoring tool, ask
for demo copies and give them all a try). XMetaL gave us on-
the-fly tag choosing and validation, so that the authors could
know when their files were valid. (

Total Project Cost So Far: In the 4-digit range (buying 11 copies
of the authoring tool).

Phase IV: Authoring

Authoring in a flat-file, rather than documentation database-
driven environment, posed some unique challenges due to the
nature of XML.

For example, we do not have a doc database that can
automatically chunk a chapter into its component information
types. We can configure our Authoring tool to do this, but
have not had time. So, instead, we create the chunks and then
collect them in the higher-level document.

The problem with this approach is that the chunks, or XML
fragments, cannot have a document type declaration in them
(see production 78 of the XML Specification-- So, you cannot
edit them directly in the authoring tool and have the nifty
dynamic tag chooser (which gets its information from the DTD).

To work around this problem, we created templates for all of
the information types and an artificial top-level document
type called a WorkBook. When the author needs to create new
information types, they use the templates to create the new
files. Then, they add the new files to the WorkBook (as
referenced, external entities) which contains a document type
declaration (and provides the authoring tool the DTD from
which to extract the tag-choosing information). They can then
open the fragments from within the workbook file and edit away.

It sounds more complicated than it is in practice.

An additional benefit to the WorkBook is that the writers now
have a logical grouping of XML fragments. They can also swap
WorkBook files for editing or information sharing.

Total Project Cost So Far: Quadruple Digits (for tools) + Time

Phase V: Output

This is perhaps the most asked-about part of XML authoring.
How do I get a book/help system/web page/whatever out of it
once I have authored it.

All of our output is generated using XSLT. We have XSLT files
for: HTML and HTML Help generation, RTF for What's This help
generation, and MIF for FrameMaker (hardcopy) generation.

The Help (HTML and What's This) are kicked off automatically from
a build script (which is why command line support was important when
choosing the tools). Hardcopy generation is currently manual.

As an example, our main HTML Help system is a single XML file with
all of the help topics included as references to the external
entities. The build script first parses that file and creates a help
project file. Then it parses it and generates the TOC. Then,
it parses the file a third time and generates all of the HTML files
for the help system. Finally, the build script kicks off HTML Help
Workshop and generates the CHM file, nicely plopping it in the
appropriate build directory.

Similarly, the build script picks up the WinHelp XML file (based on
the WinHelp DTD that we defined) and creates the What's This help project
file and RTF file, generates the help, and plops it in the appropriate
build directory.

For hardcopy, we manually kick off the build because we do not rev those
as often.

Now, you may be thinking that this is nothing that FrameMaker, conditional
text, and WWP can't handle. And you would be right if you think of it
as a book to help process. But our HTML Help contains more information
than our books -- our books are just targeted subsets of the Help
Some of that information may be reused across the books (yes, books, it is
a rather large product) or reused across the HTML Help and What's This help.
Being in XML makes it easy to reuse that information.

We can also reorganize our books in a matter of minutes simply by
rearranging, adding, or deleting entities. We can quickly create a new
book about a specific subject by collecting and arranging those
entities in an XML document, thus targeting a particular user or feature
of our application. And finally, the ultimate goal is to allow the user to
create their own books, either by on-the-fly construction based on the
features they ordered or by selecting topics from a web form.

There are countless other benefits, harped on ad nauseum on the web.

Total Project Cost So Far: Quadruple Digits (for tools) + Time

So, you can create an XML solution without an expensive document management
database or workflow utility. But there are tradeoffs, such as
scalability. However, for testing the feasibility of such a system, it is
a good compromise between functionality and cost.

XSL-FO -- A Look Towards the Future

Although in the planning stages, we are thinking about using
FO technology to bypass FrameMaker and go directly to
hardcopy (PDF or PS).

Our actual implementation will depend upon what we find
during our trials. We will probably use FOP or XEP during
our development. FOP (
is free and XEP ( offers
a free trial.

Recommended Reading

These books are listed in order from basic to advanced.

The XML Handbook, Third Edition
Charles Goldfarb and Paul Prescod

For those of you who are new to XML, this book provides an excellent
introduction to the vocabulary, concepts, and markup. Also includes
information about XSLT and real-life scenarios where XML is used in
B2B, publication, and applications.

Structuring XML Documents
David Megginson

Provides information about DTD syntax and includes 5 industry DTDs to
learn from. A must for those who are going to write their own DTD
or modify an existing one to meet their needs.

XML: The Annotated Specification
Bob DuCharme

An invaluable resource once you get past the basics.

XSLT: Programmer's Reference
Michael Kay

Once you master the basics of XSLT, you will need something heavier.
This is one of the best XSLT references that I have found. It is
full of examples.

Standard Weasel Disclaimer
All opinions and plans are strictly my own and in no
way represent the opinions or plans of my company.

AKA: I may be full of s**t.


*** Deva(tm) Tools for Dreamweaver and Deva(tm) Search ***
Build Contents, Indexes, and Search for Web Sites and Help Systems
Available 4/30/01 at or info -at- devahelp -dot- com

Sponsored by DigiPub Solutions Corp, producers of PDF 2001 Conference East,
June 4-6, Baltimore, MD. Now covering Acrobat 5. Early registration deadline
April 27.

You are currently subscribed to techwr-l as: archive -at- raycomm -dot- com
To unsubscribe send a blank email to leave-techwr-l-obscured -at- lists -dot- raycomm -dot- com
Send administrative questions to ejray -at- raycomm -dot- com -dot- Visit for more resources and info.

Previous by Author: I have opened the floodgates (was Incorporating XML into document ation)
Next by Author: RE: Making them read the documentation
Previous by Thread: RE: The Engineer's Iron Ring
Next by Thread: Re: Incorporating XML in Documentation

What this post helpful? Share it with friends and colleagues:

Sponsored Ads