TechWhirl (TECHWR-L) is a resource for technical writing and technical communications professionals of all experience levels and in all industries to share their experiences and acquire information.
For two decades, technical communicators have turned to TechWhirl to ask and answer questions about the always-changing world of technical communications, such as tools, skills, career paths, methodologies, and emerging industries. The TechWhirl Archives and magazine, created for, by and about technical writers, offer a wealth of knowledge to everyone with an interest in any aspect of technical communications.
I sympathize with your problem. If you have the power to, I suggest you
require submitters to use *one* format rather than try to *figure out*
what they're using. Still, I accept you may not be able to do this.
If you and your developer are working on Windows platforms, You will
probably run into two character encodings:
ANSI is the 8-bit (256 character) set on which Windows was originally
based. It has the advantage of covering most of the "alphabets"
(scripts) that use Roman letters (glyphs). Unfortunately, it can't cover
Middle Eastern, Oriental, or other scripts.
The most modern, commonly-used "worldwide" character encoding in use is
Unicode. Unicode is a 16-bit encoding, but When someone says they are
using "Unicode", they often mean they're using UTF-8. UTF-8 is a
transform that allows you to *transmit* Unicode characters in an 8-bit
format. This is usually safer, since nearly all the hardware and
OS/networking software in the world deals correctly with 8-bit units.
All the programming languages I know of that support Unicode provide
quick conversion from UTF-8 to Unicode and vice versa.
As a note, Java uses Unicode internally. When you write to output, it
usually converts the output to UTF-8. Swing also converts to UTF-8 or to
ANSI, as necessary. Java is intelligent enough to know what to convert,
and when. Conversion from any known format to Unicode is simple to do
and available from a variety of programming languages. Though I don't
know Perl, I'm sure that Perl supports it either "out-of-the-box" or
with a readily obtainable module.
I don't think that any language provides APIs for detecting what the
encoding is for an *arbitrary* text file. Most of ANSI maps correctly to
Unicode; you've discovered that alas, not *all* of it does. Fortunately,
you can unambiguously define a constant to be a particular Unicode
value, so you can do some "rule-of-thumb" algorithms that would give you
a good idea if a file is ANSI or UTF-8.
As an example, choose a character or characters (like smart quotes) that
are different between ANSI and Unicode, such that a Unicode file is
likely to include them, but an ANSI file is not likely to have the
equivalent ANSI character (again, smart quotes come to mind).
Code the character or characters as Unicode constants and then do a
quick scan through the input file for a match. If you don't get a match,
you can conclude that it's not Unicode. I would guess that some
programmer has already solved this for Perl, too.
As you can see, it's been too long since I was intimately involved in
this. Still, I am convinced that your developers have easy access to
editors that support UTF-8. If you're receiving ANSI files, it's because
your submitters are being a bit lazy. You could certainly figure out who
sends you a particular format, and then convert from that to Unicode
when you receive one of his or her files.
jmalin -at- tuvox -dot- com
The views expressed in this document are those of the sender, and do not
necessarily reflect those of TuVox, Inc.
Is there a way of detecting what encoding your "text" files are in? Is
there a way of automatically converting them to a particular chosen
Now Shipping -- WebWorks ePublisher Pro for Word! Easily create online
Help. And online anything else. Redesigned interface with a new
project-based workflow. Try it today! http://www.webworks.com/techwr-l