AutoCaption Logo
The mystery of UTF-8

"UTF" stands for Unicode Transformation Format.  The "8" says it uses 8 bits to express each character.  UTF-16 on the other hand expresses each character with 16 bits.

OK.  That information isn't very helpful for captioning.  What's important is to realize that UNICODE characters can be expressed in more than one way.  So just knowing that your DVD authoring system wants UNICODE text is only part of the story.  You also need to know if the UNICODE needs to be expressed in UTF-8 or UTF-16.

One particularly smart thing the UNICODE designers did was to make the first 128 characters of UNICODE the same as ANSI ASCII, so pure ASCII documents are also UNICODE.  The 'Latin 1' character set (ISO-8859-1) is a variation of the original IBM PC's 8 bit character set.  When when graphics cards became popular is wasn't necessary to reserve characters for drawing lines and boxes around text.  So this variant replaces some of the drawing characters to characters unique to a greater variety of languages.  Later, these extra 96 characters retained their character numbers in UNICODE.  In short, ASCII is a subset of ISO-8859-1 which, in turn, is a subset of UNICODE.

The focus if the UNICODE designers was on assigning a unique number for for every printable character in every language, the designers are less interested in telling people how that number must be expressed.  Although UNICODE provides a number of ways to express characters above hexadecimal 7F (127 in decimal), none are quite the same as ISO-8859-1.

Further reference material on how characters are expressed can be found in Roman Czyborra's excellent The ISO 8859 Alphabet Soup (go to the author's site or an archived copy at Linköpings University in Sweden).

What is UTF-8?

8 bits can express any number up to 256.  But UNICODE has a lot more that 256 characters.  So more than one system to express the "extra" characters UNICODE characters evolved. 

UTF-8 is one such system.  It uses a variable number of bytes (from 1 to 4 according to RFC3629) to express each character.  It is a good solution for 8 bit microprocessors used in industrial equipment for example.

But many applications evolved using only 7 bits.  People didn't like to wait for things like modems, so clever engineers figured that they could speed up communications by not transmitting one of the 8 bits in each character.  After all, they reasoned, the basic alphabet for each western language fits in 127 characters.  And surely the sender and receiver will speak the same language.

The ANSI ASCII standard, and expressing characters in 7 bits, evolved from this need for efficient standardized communications.  Everything would work fine in Western as long as the sender and receiver agree on the language for the transmission.

By the time UNICODE got off the ground, using 7 bits to express each character was common.  By then it was also pretty unnecessary to conserve one crummy bit.  Hard drives were bigger, RAM was cheap and 300 baud modems were used only for door stops.  So the UTF-8 folks went back to expressing characters with 8 bits and cleverly put the 8th bit to work.

In characters expressed according to UTF-8, the most significant bit (MSB) of each byte will be 0 for single byte characters.  If the MSB is 1, it signifies this byte is part of a multibyte character.  The idea is to signal that more than one byte will be used to express a single character.

How can a captioner tell if they have UTF-16, UTF-8, or ANSI ASCII?

Obviously this UTF-8 scheme is indistinguishable from 8-bit expressions such as ANSI ASCII (eg. Latin1) in which all characters are 8 bits and all characters beyond 127 have the high bit set.

So somehow the captioner has to know if characters in a document are expressed according to UTF-8 or ANSI ASCII or some other scheme.  Unfortunately, there is no sure fire way to tell, but here's what to look for:

I have UTF-16 captioning assets.  My client wants UTF-8.

Use your word processor.

Open the transcript or DVD subtitling asset in an ordinary word processor and use the Save As feature to convert between ANSI ASCII variations, UTF-8, and UTF-16.

Rich Text Format (RTF) UNICODE documents always start by saying how the characters are expressed so they should be no problem importing into AutoCaption.

Problems occur when a DVD authoring package wants old style UTF-8 multibyte assets and AutoCaption has generated UTF-16.  In that case, simply open the UTF-16 file in your word processor and use Save As to save it in UTF-8.

Gory details of Multibyte

The number of leading 1 bits in the first byte of a multi-byte sequence is equal to the total number of bytes.  Each of the follow-on bytes will have the first bit set to 1 and the second to zero.  All remaining bits (shown as 'x' below) are used to respresent the character number.

1 byte character 0xxxxxxx

2 byte character 110xxxxx 10xxxxxx

3 byte character 1110xxxx 10xxxxxx 10xxxxxx

4 byte character 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-16 encoding is an alternative byte expression of Unicode which for most cases amounts to a fixed-width 16 bit code.  ASCII and Latin1 characters (the first 256 characters) are expressed in 8 bits as normal, but with a preceding null (all bits low) byte.  Although 16 bit expressions are conceptually simpler than UTF-8 they have two major drawbacks, unless, like AutoCaption, the text handling is designed for UTF-16:

The advantage of using UTF-16 far outweigh the drawbacks:

For more information, visit the UTF-8 and Unicode FAQ for Unix/Linux.

 

double rule
Summary       Costs       Buy       Powerful_Tools       Technical

The_Process        Downloads       Contact_Us       Home
double rule

 

W3C HTML 4.01 Strict certificationThe general outline and logic of this user bulletin must be attributed to the excellent discussion at sourceforge.net, we made only minor changes to hopefully make the material more appropriated to people who caption.
user_bulletin_utf-8.html  80519