Text to xml: Encoding issues

Subject: Text to xml: Encoding issues
From: "Inbar, Paul" <paul -dot- inbar -at- intel -dot- com>
To: <techwr-l -at- lists -dot- techwr-l -dot- com>
Date: Tue, 27 Dec 2005 21:36:39 +0200

Hi all,

Can anyone point me to information dealing with character encoding? Specifically, I have the following situation. I receive "text" files from software developers. From these text files I create (via a Perl script) an xml file that I transform to html or load into Structured Frame. I thought that all I would have to do is take care to convert those characters you need to convert so as not to confuse the xml (<, ', etc.). But it seems that the issue is more complex than that, and I don't even know exactly how to phrase the problem. I think it stems from the fact that the developers unknowingly submit different kinds of "text", (some unicode, some UTF-8, some ascii)(please excuse me if I am being imprecise here - I am fuzzy on these issues). For instance, a common occurrence is that the developers copy and paste from Word and include smart quotes. Now, in a text editor I can search and replace these with straight quotes, but I can't figure out how to get Perl to be able to do this automatically. If I just leave the smart quotes in, Structured Frame manages to interprete the characters correctly, but in the html document I produce via XSLT the "smart" quotes show up as boxes. Now I know that there is something about encoding that you can handle in the xml and html files themselves. But I am just not sure how to fit this all together, or how to deal with this problem. Does anyone have any experience in handling these kinds of issues? Is there a way of detecting what encoding your "text" files are in? Is there a way of automatically converting them to a particular chosen format?


Now Shipping -- WebWorks ePublisher Pro for Word! Easily create online
Help. And online anything else. Redesigned interface with a new
project-based workflow. Try it today! http://www.webworks.com/techwr-l

Doc-To-Help 2005 now has RoboHelp Converter and HTML Source: Author
content and configure Help in MS Word or any HTML editor. No
proprietary editor! *August release. http://www.componentone.com/TECHWRL/DocToHelp2005

You are currently subscribed to TECHWR-L as archive -at- infoinfocus -dot- com -dot-

To unsubscribe send a blank email to
techwr-l-unsubscribe -at- lists -dot- techwr-l -dot- com
or visit http://lists.techwr-l.com/mailman/options/techwr-l/archive%40infoinfocus.com

To subscribe, send a blank email to techwr-l-join -at- lists -dot- techwr-l -dot- com

Send administrative questions to lisa -at- techwr-l -dot- com -dot- Visit
http://www.techwr-l.com/techwhirl/ for more resources and info.


Previous by Author: Re: Potential Poll Question
Next by Author: RE: To Human Factors or not to Human Factors
Previous by Thread: RE: Don Norman on Manual Writing
Next by Thread: Re: Text to xml: Encoding issues

What this post helpful? Share it with friends and colleagues:

Sponsored Ads