TechWhirl (TECHWR-L) is a resource for technical writing and technical communications professionals of all experience levels and in all industries to share their experiences and acquire information.
For two decades, technical communicators have turned to TechWhirl to ask and answer questions about the always-changing world of technical communications, such as tools, skills, career paths, methodologies, and emerging industries. The TechWhirl Archives and magazine, created for, by and about technical writers, offer a wealth of knowledge to everyone with an interest in any aspect of technical communications.
Subject:RE: Convert Word files to XML? From:"Janoff, Steven" <Steven -dot- Janoff -at- ga -dot- com> To:"techwr-l -at- lists -dot- techwr-l -dot- com TECHWR-L" <techwr-l -at- lists -dot- techwr-l -dot- com>, Richard Hamilton <dick -at- rlhamilton -dot- net> Date:Fri, 25 Jul 2014 11:47:01 -0700
Big thanks to Robert, Tony, and Richard for the responses so far -- thank you!
Yes, sorry, I should have been more specific.
For some applications I'll be converting the Word files to S1000D/XML for use in Arbortext.
For others, I'll be converting the Word files to DITA/XML for use in both Arbortext and Oxygen.
Not a large bulk of Word files up front but the project could grow. (First few docs are 50-100 pages each.)
Yes, I've seen the Eliot Kimber sources so that looks good and I will check those out.
Thank you again and I'll follow up with more details as I understand the project more. Meantime I'll look into what's been suggested so far.
Thanks, Richard, for the in-depth discussion. (I do have some XSLT/XSL-FO experience but I'll need to refresh.)
PS - Also wondering what you do when you get a plain text file and you want to convert it to XML/DITA, for example. Do you primarily start with templates? Thanks for any direction there too. There might be a possibility to receive some of those.
From: On Behalf Of Richard Hamilton
Sent: Friday, July 25, 2014 11:21 AM
To: techwr-l -at- lists -dot- techwr-l -dot- com TECHWR-L
Subject: Convert Word files to XML?
There are several factors. The most important are: what XML schema you are converting to, how clean your Word content is, and how much content you need to convert.
Bottom line for me is that if you have a lot of content to convert, you should seriously consider contracting the job out to a conversion company, unless you have some serious expertise with XSL and related tools.
Here is some detail on some tools to consider if you want to go it alone:
I convert Word to DocBook XML using Open Office, which will export DocBook directly. However, sometimes it's better to export HTML and then use a utility called Herold to convert to DocBook. And, I've also used the rather circuitous route of uploading Word to a Confluence wiki, then exporting DocBook using a plug-in exporter developed by a company called k15t software. Which I use in a given case depends on what the input looks like.
You can convert Word to DITA using DITA for Publishers (dita4publishers.sourceforge.net). I haven't used it myself, but I know the developer (Eliot Kimber), and he does quality work, so I'd definitely give it a try if you're headed towards DITA.
One caveat is that I've found it exceedingly rare that a conversion will be completely clean. You need to plan on doing some kind of cleanup using an XSL stylesheet, perl, manual editing, or a combination of all three on the output of any of these tools unless your input is really simple and well suited to the tool you use (which, with Word, I've never seen:-).
XML for Technical Communicators http://xmlpress.net
hamilton -at- xmlpress -dot- net
On Jul 25, 2014, at 10:46 AM, Janoff, Steven wrote:
> For those with experience converting Word files to XML:
> What's the easiest or most effective way you've found to do this?
> Does it depend on the XML editor you're importing into?
> Arbortext is currently editor of choice, but I might also have the opportunity to install Oxygen at home.
> Thanks for your advice. I'll be researching on the web also, but that looks like a bit of a mish-mash.
Read about how Georgia System Operation Corporation improved teamwork, communication, and efficiency using Doc-To-Help | http://bit.ly/1lRPd2l