RE: Text to xml: Encoding issues

Subject: RE: Text to xml: Encoding issues
From: "Joe Malin" <jmalin -at- tuvox -dot- com>
To: "Inbar, Paul" <paul -dot- inbar -at- intel -dot- com>, <techwr-l -at- lists -dot- techwr-l -dot- com>
Date: Tue, 27 Dec 2005 14:03:38 -0800

Dear Paul,

I sympathize with your problem. If you have the power to, I suggest you
require submitters to use *one* format rather than try to *figure out*
what they're using. Still, I accept you may not be able to do this.

If you and your developer are working on Windows platforms, You will
probably run into two character encodings:


ANSI is the 8-bit (256 character) set on which Windows was originally
based. It has the advantage of covering most of the "alphabets"
(scripts) that use Roman letters (glyphs). Unfortunately, it can't cover
Middle Eastern, Oriental, or other scripts.

The most modern, commonly-used "worldwide" character encoding in use is
Unicode. Unicode is a 16-bit encoding, but When someone says they are
using "Unicode", they often mean they're using UTF-8. UTF-8 is a
transform that allows you to *transmit* Unicode characters in an 8-bit
format. This is usually safer, since nearly all the hardware and
OS/networking software in the world deals correctly with 8-bit units.
All the programming languages I know of that support Unicode provide
quick conversion from UTF-8 to Unicode and vice versa.

As a note, Java uses Unicode internally. When you write to output, it
usually converts the output to UTF-8. Swing also converts to UTF-8 or to
ANSI, as necessary. Java is intelligent enough to know what to convert,
and when. Conversion from any known format to Unicode is simple to do
and available from a variety of programming languages. Though I don't
know Perl, I'm sure that Perl supports it either "out-of-the-box" or
with a readily obtainable module.

I don't think that any language provides APIs for detecting what the
encoding is for an *arbitrary* text file. Most of ANSI maps correctly to
Unicode; you've discovered that alas, not *all* of it does. Fortunately,
you can unambiguously define a constant to be a particular Unicode
value, so you can do some "rule-of-thumb" algorithms that would give you
a good idea if a file is ANSI or UTF-8.

As an example, choose a character or characters (like smart quotes) that
are different between ANSI and Unicode, such that a Unicode file is
likely to include them, but an ANSI file is not likely to have the
equivalent ANSI character (again, smart quotes come to mind).

Code the character or characters as Unicode constants and then do a
quick scan through the input file for a match. If you don't get a match,
you can conclude that it's not Unicode. I would guess that some
programmer has already solved this for Perl, too.

As you can see, it's been too long since I was intimately involved in
this. Still, I am convinced that your developers have easy access to
editors that support UTF-8. If you're receiving ANSI files, it's because
your submitters are being a bit lazy. You could certainly figure out who
sends you a particular format, and then convert from that to Unicode
when you receive one of his or her files.

Joe Malin
Technical Writer
jmalin -at- tuvox -dot- com
The views expressed in this document are those of the sender, and do not
necessarily reflect those of TuVox, Inc.

Is there a way of detecting what encoding your "text" files are in? Is
there a way of automatically converting them to a particular chosen


Now Shipping -- WebWorks ePublisher Pro for Word! Easily create online
Help. And online anything else. Redesigned interface with a new
project-based workflow. Try it today!

Doc-To-Help 2005 now has RoboHelp Converter and HTML Source: Author
content and configure Help in MS Word or any HTML editor. No
proprietary editor! *August release.

You are currently subscribed to TECHWR-L as archive -at- infoinfocus -dot- com -dot-

To unsubscribe send a blank email to
techwr-l-unsubscribe -at- lists -dot- techwr-l -dot- com
or visit

To subscribe, send a blank email to techwr-l-join -at- lists -dot- techwr-l -dot- com

Send administrative questions to lisa -at- techwr-l -dot- com -dot- Visit for more resources and info.

Previous by Author: RE: Don Norman on Manual Writing
Next by Author: RE: Thanks. A lot!
Previous by Thread: Re: Text to xml: Encoding issues
Next by Thread: Documentation review

What this post helpful? Share it with friends and colleagues:

Sponsored Ads

Sponsored Ads