SGML --> HTML --> XML (long)
Simon North <north -at- SYNOPSYS -dot- COM>
Tue, 20 May 1997 15:53:31 +0000
Apologies if this is a repeat posting, my mail gateway seems to be
misbehaving and I keep getting bounces locally. At the risk of
annoying the members of this illustrious list, I shall try again.
This an (edited) copy of a quick article I dashed off for my
documentation group about XML at the recent SGML Europe '97
conference in Barcelona, Spain.
SGML --> HTML --> XML
I have just got back from the SGML Europe '97 conference in
Barcelona (don't be envious it rained nearly the whole time).
I won't take up your time with a detailed report of everything that
was covered but the announcements, snippets of information and other
newsbytes that I've assembled about XML make (I think) for an
interesting piece of news.
Please bear with me, this is a long story.
Channel Definition Format (CDF)
Despite its success, everyone has been slowly coming to agree
that HTML has about reached its limits. We've seen HTML 1.0,
HTML+, HTML 2.0, and now two releases of HTML 3.2 (with a version 4.0
rumored to be in the works) and we've had both Netscape's and
Microsoft's extensions. A proprietory solution had failed, and the
'struggle' between Microsoft and Netscape was even beginning to
fragment the market.
Then, on March 21 1997, Microsoft, amidst a flurry of almost
total silence, announced
publication of a new standard called Channel Definition Format
Essentially, CDF lays the foundation for the so-called "push"
technology by which some of the leading WWW content providers
(among them Microsoft, AOL, Ncompass and Pointcast, to name just
a few) will be able to use the Internet to deliver "information"
directly to your desktop without you having to explicitly request it
(as we do now when we "browse" a web page).
Normally this could be have been written off as just evidence of
the industry's sorry lack of imagination in formulating a
business model for earning money via the Internet, and just an
attempt to transform it into a business model they *do*
understand - namely television. However, one small aside in
Microsoft's announcement mentions that CDF is based on XML ...
Big deal? Yes. An event possibly as revolutionary as the
publication of HTML itself ... and the WWW would not have become
what it is without HTML.
Extensible Markup Language (XML)
So. What is XML? Well, XML is a World Wide Web Consortium (W3C
- one of the closest things that the Internet has to a standards
body) working draft for a "slimmed down" version of SGML for use
on the Web (see
These are two of the most easily readable drafts that have come
out of the W3C, but for those with little time or inclination to
tackle the working papers themselves I shall try to highlight
some of the most important features and bring out some of the
less obvious points based on other work that I know is in
XML has been called DTD-less SGML and, in a way, it is - but it's
a very great deal more. First off - and make no mistake about this -
XML is fully compliant SGML (which HTML no longer is). In fact, there
are two 'flavours' of XML called "well-formed" and "valid".
- Well-formed XML does not need a DTD. Basically, you can
use whatever tags you want and tag what you want. You can
even use HTML as long as you follow a few (3) basic
- Valid XML does have a DTD. This is not a "pure" SGML DTD
since there are about 50 restrictions that must be obeyed,
and the syntax of an XML DTD is slightly different to
'pure' SGML. However, the effect of XML has already been so
radical that the SGML (and HyTime) standards are being
rewritten by the ISO Work Group to accommodate the changes
that XML requires to make it fully compliant with these
The fact that XML *is* SGML means that all SGML tools will be
able to work with valid XML. Any text editor, word processor or
desktop publishing package will be able to work with well-formed
The SGML community has looked down its nose at HTML for some
years now, regretting that such an obvious "hack" should be so
successful while SGML is so much better. Be that as it may, SGML
is an attempt to be all things to all people; a solution so
generic that it became a victim of its own strength. SGML is
complex, bulky, difficult, and expensive. For many solutions it
is the software equivalent of attempting to crack an egg with a
nuclear bomb. If it hadn't been for massive government subsidies
(Canada) or academic interest (France and the UK), SGML would
probably never have made it out of the research lab.
XML discards all of the more sophisticated features of SGML that
made the creation of SGML software so complex, and costly (it was a
design goal that a graduate CS student should be able to write an XML
parser in a week -- they didn't quite meet the goal, but it was
close). Many, if not all of the features that are 'sacrificed' were
not needed in applications such as ours (technical documentation),
will probably not be missed.
In turn, XML extends the capabilities of SGML with:
- support of URLs as links
- HyTime linking (one-to-many, many-to-many, bi-directional)
- EXTERNAL link lists! Finally!!!! You can link from and to
READ-ONLY material. You can use a database to manage and
control the links.
- extended TEI linking (this isextremely sophisticated,
allowing you to link to parents and children of elements,
and to ranges of elements)
- makes it more suitable for database storage
- makes it more open to simple (batch) processing
and XML extends HTML to give, among other things:
- complete extensibility
- validatable content
- structure and hierarchy
- conditional content
- text entities (both internal and external)
- external entities
- (automatic) redirection
But is this the end of HTML? No. HTML still does a good job. HTML
continue to be the preferred markup language for visual presentation
where ease of use is important. SGML too will still have a place as
the preferred markup for high-end purposes. XML will fill the "middle
ground" where strength and reliability are at a premium.
Is it for Real?
Only the market and the W3C can really decide. However, despite
there being a lot of hype (and there being more than a few
political questions and secret agendas being involved), there are
enough signs to suggest that we should take this very seriously:
- Jean Paoli (an SGML expert who used to work for a firm call
Grif) is now a member of Microsoft's Internet Explorer
- On March 13, 1997, the Graphic Communications Association
(GCA), at Microsoft's request, requested the SGML Forum of
New York to create a Technical Advisory Group to "provide
input on what features and capabilities would be desirable
for an XML engine in Internet Explorer 5.0".
- Various SGML experts have been recent visitors to
Netscape's campus in Mountain View, and to Microsoft's site
- Both Netscape and Microsoft have pledged support for XML in
release 5 of their browsers.
- The SGML conference in Washington in December 1997, has
been renamed to "XML/SGML '97".
- John Bosak of SunSoft has announced that Sun intends to
publish all of its Answerbook online documentation using
XML. (They use a dedicated docs server - using
DynaWeb - which will give element based access control to
the information ... an experimental version can already be
seen at http://docs.sun.com/).
- All of the SGML vendors have announced the (intended)
release of XML authoring and development tools.
- Since XML uses Unicode (instead of ASCII), the Asian market
seems to be desperately interested (Fujitsu are said to be
- The SGML and HyTime ISO standards are being updated
(the SGML standard was up for review anyway, and work
on the HyTime Technical Corrigendum was also pretty
advanced) to make some of XML's "tweaks" to the standards
properly legal. [For the SGML-aware, the "SGML Extended
Facilities" amendment, will incorporate Annex A of the 10744
standard, a new Annex J will be added to make extend character set
handling, and Annexes K & L "Web SGML Adaptions" is expected to be
approved (says Charles Goldfarb) in December this year. I have 2
pages of notes on what these adaptions are, much too long for
but the changes are almost as exciting as XML itself --- if anyone
really wants to know, I will write up my notes ... ).
When will all this happen?
XML is not (yet) a standard. XML is not even finished (yet).
There will (probably) be 4 parts to the standard:
- Part 1, the syntax, has been published as a draft.
- Part 2, Linking, has been published as a draft.
- Part 3, Styles, will be published as a draft in June (it is
not yet clear whether DSSSL(-O) or CSS will be adopted).
As a personal plea, I do hope it will be DSSL since this will
give us the transformation tools we need to be able to generate
things like TOCs, LOFs and indexes.
- Part 4, Chunking, has been delegated to the SGML Open
Another aspect, the Document Object Model (which will provide an API
for the manipulation of document 'chunks') will be published in time
for the XML/SGML '97 conference in December.
The complete XML standard is due to be finished by the Summer of 1998
(the working group drafting the documents is well aware that they
*must* complete this quickly or the market will find a different
What does this mean for technical writers?
Well, it means that we can seriously consider authoring technical
documentation in XML instead of 'full' SGML, while retaining all the
features that we wanted from SGML (simple post-processing, multiple
purposes/formats/media) but while not having to climb the steep
learning curve of SGML and all the overhead and cost that it normally
It means that we can continue (or even accellerate) our migration to
SGML knowing that we will be able to move to Intranet/Internet
publication without any extra work/conversion/ ....
We will be able to publish our documentation using either SGML
or XML on the Web (Intranet or Internet) without any intermediate
conversion. These web documents will support the automatic generation
of tables of contents, indexes, glossaries and other navigation aids.
These web documents will have attached styles which can (if desired)
be switched by the user. Users will be able to add links and
bookmarks to read-only documents.
Ultimately, we will be able to do everything we are able to do
now with documents in SGML, and more ....
like one-to-many links, many-to-many links, links to read-only
documents, namable navigable links (virtual documents),
availability on any platform that supports a web browser.
Sun are putting all their Answerbook documentation into this and they
claim to be working on a pretty clever fall-back addressing
mechanism. Basically, a link to another document will look for that
document locally. If it can't find it, it will expand the search
gradually until it does find it. If it can't find it anywhere, it
will finally fall backto a copy of the manual on Sun's own servers
... and all this will happen transparently to the user.
Suddenly (no, not really, XML has been in development for a fair
while), we can seriously imagine single-sourcing all of our
I'll get off my soapbox now, and get to work ....
TECHWR-L (Technical Communication) List Information: To send a message
to 2500+ readers, e-mail to TECHWR-L -at- LISTSERV -dot- OKSTATE -dot- EDU -dot- Send commands
to LISTSERV -at- LISTSERV -dot- OKSTATE -dot- EDU (e.g. HELP or SIGNOFF TECHWR-L).
Search the archives at http://www.documentation.com/ or search and
browse the archives at http://listserv.okstate.edu/archives/techwr-l.html
Search our Technical Writing Archives & Magazine