TechWhirl (TECHWR-L) is a resource for technical writing and technical communications professionals of all experience levels and in all industries to share their experiences and acquire information.
For two decades, technical communicators have turned to TechWhirl to ask and answer questions about the always-changing world of technical communications, such as tools, skills, career paths, methodologies, and emerging industries. The TechWhirl Archives and magazine, created for, by and about technical writers, offer a wealth of knowledge to everyone with an interest in any aspect of technical communications.
Subject:Re: PDF to XML conversion From:Michael Smith <smith -at- io -dot- com> To:techwr-l -at- lists -dot- raycomm -dot- com Date:Sat, 8 Jul 2000 23:30:33 -0500
On Friday, July 07, 2000, Karen Field wrote:
> Anyone know anything about converting PDF docs to XML? Is this
> even possible? Is it possible to convert Word docs to XML?
Both are possible, but depending on the nature and number of
documents you need to convert, you many find that it's something
for which you'll want to get some consulting/outsourcing help.
One company I know of that has made this a specific focus of their
XML services offerings is Texterity. You may want to take a look
at their TextCafe site <http://www.textcafe.com>.
They have a simple form you can use to submit/upload a file you'd
like to convert to XML. Once they've taken a look the file,
they'll give you a free quote on conversions costs. (Although the
form only mentions converting PDFs, I'm sure they can also give
you a quote on any Word document your upload.) I think any other
XML consulting organization that offers conversion services ought
to be able to do the same thing for you.
The main issue with converting a document from PDF, Word, or
anything else to XML is that you're going from a format without
much explicit structure to a format which is purely structural.
So the challenge is to expose whatever structure the document has
in its current form so that you can automate the conversion as
much as possible.
For example, if a Word document you want to convert is already
formatted with logical paragraph and character styles, it's going
to simplify automation of the conversion quite a bit. If, on the
other hand, the differences between headings, etc. in the document
are only implied or apparent through differences in character
styles, indentation, and so on, then automating the conversion is
going to be much more difficult or even impossible.
You also need to keep in mind that without adding a human step to
the pre- or post-processing (that is, without doing manual tagging
of some kind), you're just going to end up with XML that's only as
structured as the Word or PDF documents you start with. So if your
source documents lack a discernible, usable structure, then your
XML will also.
-- Mike Smith
Michael Smith ... xml-doc-owner -at- egroups -dot- com
XML-DOC mailing list ... http://www.egroups.com/group/xml-doc/
Subscribe: ... xml-doc-subscribe -at- egroups -dot- com
Subscribe to digest: ... xml-doc-digest -at- egroups -dot- com