Unicode and UTF-8

Richard Suchenwirth 1999-08-11 - Global players need many languages. And writing systems. For Chinese, Korean, or just Greek, we need a way to code such non-ASCII characters.

For a historical perspective and beginner's technical introduction, see Joel Spolky's missive at http://www.joelonsoftware.com/articles/Unicode.html

The encoding standard to cover all these writing systems is the Unicode ( http://www.unicode.org/

), a 16 (or more) bit-wide encoding for presently 94,140 distinct coded characters derived from more than 25 supported scripts (as of Unicode 3.1). Tcl/Tk supports the Unicode from version 8.1 as 16-bit chars or in the UTF-8 encoding as the internal representation for strings.

UTF-8 is made to cover 7-bit ASCII, Unicode, and its superset ISO 10646 (which offers 31 bits width, but seems to be an overkill for most practical purposes). Characters are represented as sequences of 1..6 eight-bit bytes - termed octets in the character set business - (for ASCII: 1, for Unicode: 2..3) as follows:

ASCII 0x00..0x7F (Unicode page 0, left half): 0x00..0x7F. Nothing changed.
Unicode, pages 00..07: 2 bytes, 110aaabb 10bbbbbb, where aaa are the rightmost bits of page#, bb.. are the bits of the second Unicode byte. These pages cover European/Extended Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic.
Unicode, pages 08..FE: 3 bytes, 1110aaaa 10aaaabb 10bbbbbb. These cover all the rest of Unicode, including Hangul, Kanji, and what else. This means that East Asian texts are 50% longer in UTF-8 than in pure 16 bit Unicode.
ISO 10646 codes beyond Unicode: 4..6 bytes. Many of the newer emoji characters fall in this range.

The full list of encoded characters

LV Just this week I had a developer ask me how to handle characters in the 4-6 byte range. How does that work in Tcl? Right now, their tcl application is having a problem when encountering the &Ascr; character (which is a script-A), which has the unicode value of 0x1D49C. tdom says that only UTF-8 chars up to 3 bytes in length can be handled. Is this just a tdom limitation, or is it also a Tcl limitation?

(2016-03-23) J'raxis I just ran into the same issue working on a Tcl script for an IRC app, dealing with the increasingly popular emoji characters. Above it says Tcl represents Unicode internally with 16 bits, which means U+FFFF is the highest it can support. For example Tcl will output U+1F4A9 as "ð\x9f\x92©". In the same line of text I was testing with, other UTF-8-encoded input characters in the U+0000..FFFF range came through just fine.

(2016-12-07) wiwo There is a preprocessor directive in generic/tcl.h:

#define TCL_UTF_MAX 3

The comment says, UNC-2 (max 3 and 4 bytes) is save. When compiling from source, this value is set to 3. Does anybody know what the drawbacks would be when setting this value to 4? APN I believe AndroWish and related builds bump it up to 4 so it should work though it will probably lose compatibility with standard Tcl extensions unless they are also rebuilt. Not sure other what changes chw had to make in addition to changing that directive.

chw in AndroWish TCL_UTF_MAX is set to 6 which turns out to use max. 4 byte long UTF-8 sequences but to represent one Tcl_UniChar as a 32 bit item. Although this doubles memory requirements it keeps consistency w.r.t. counting codepoints to TCL_UTF_MAX set to 3. In AndroWish some additional changes were required, too, in order to proper support non Android platforms which have a UTF-16 OS interface (Windows, see undroidwish). One of your options is to search through the AndroWish source tree for places similar to #if TCL_UTF_MAX == 4 and #if TCL_UTF_MAX > 4. The TCL_UTF_MAX set to 4 was deliberately not chosen, since it requires to represent codepoints larger than 0xFFFF as surrogate pairs which makes character counting expensive and which aren't not fully handled by the Tcl core.

A general principle of UTF-8 edit

A general principle of UTF-8 is that the first byte either is a single-byte character (if below 0x80), or indicates length of multi-byte code by the number of 1's before the first 0 and is then filled up with data bits. All other bytes start with bits 10 and are then filled up with 6 data bits. See also UTF-8 bit by bit. A sequence of n bytes can hold

 b = 5n + 1  (1 < n < 7)

bits "payload", so the maximum is 31 bits for a 6-byte sequence.

It follows from this that bytes in UTF-8 encoding fall in distinct ranges:

   00..7F - plain old ASCII
   80..BF - non-initial bytes of multibyte code
   C2..FD - initial bytes of multibyte code (C0, C1 are not legal!)
   FE, FF - never used (so, free for byte-order marks).

The distinction between initial and non-initial helps in plausibility checks, or to re-synchronize with missing data. Besides, it's independent of byte order (16-bit Unicode inherits byte order, so has to express that with the magic FEFF. Should you read FFFE, you're to swap). Tcl however shields these UTF-8 details from us: characters are just characters, no matter whether 7 bit, 16 bit, or more. I liked Tcl since 7.4, and I love it since 8.1. -- Richard Suchenwirth

Here's a little helper that reconverts a real Unicode string (e.g. as pasted to a widget) to \u-escaped ASCII:

 proc u2a {s} {
    set res ""
    foreach i [split $s ""] {
        scan $i %c c
        if {$c<128} {append res $i} else {append res \\u[format %04.4X $c]}
    }
    set res
 } ;#RS

i found your u2a nice and handy but added a mapping since it gave me for some reason 'wrong' numbers.

 ## CP1252.TXT says these char should be that 
 ::variable myuscanlist 
 ::set myuscanlist [::list                   \
     "\\u0080" "\\u20AC" "\\u0081" ""        \
     "\\u0082" "\\u201A" "\\u0083" "\\u0192" \
     "\\u0084" "\\u201E" "\\u0085" "\\u2026" \
     "\\u0086" "\\u2020" "\\u0087" "\\u2021" \
     "\\u0088" "\\u02C6" "\\u0089" "\\u2030" \
     "\\u008A" "\\u0160" "\\u008B" "\\u2039" \
     "\\u008C" "\\u0152" "\\u008D" ""        \
     "\\u008E" "\\u017D" "\\u008F" ""        \
     "\\u0090" ""        "\\u0091" "\\u2018" \
     "\\u0092" "\\u2019" "\\u0093" "\\u201C" \
     "\\u0094" "\\u201D" "\\u0095" "\\u2022" \
     "\\u0096" "\\u2013" "\\u0097" "\\u2014" \
     "\\u0098" "\\u02DC" "\\u0099" "\\u2122" \
     "\\u009A" "\\u0161" "\\u009B" "\\u203A" \
     "\\u009C" "\\u0153" "\\u009D" ""        \
     "\\u009E" "\\u017E" "\\u009F" "\\u0178" ];

 proc u2a {s} {
    ::variable myuscanlist 
    set res ""
    foreach i [split $s ""] {
        scan $i %c c
        if {$c<128} {
            append res $i
        } else {
            append res [::string map $myuscanlist \\u[format %04.4X $c] ] ;# koyama
        }
    }
    set res
 } ;#RS

Note: All is not that well. Unicode text files can so far not be sourced, or autoindexed, by Tcl. As a preliminary fix, see my Unicode file reader -- RS

Note that the Unicode character \u0000 (which corresponds to an ASCII NUL character or \x00 byte in binary data) is represented using the two-byte form in Tcl. This minor variant on the UTF-8 standard is used to make passing strings through C code much easier, and it makes virtually no difference for most applications (and for the rest, it is nothing but a Good Thing!)

DKF - RS: the two bytes are C0 80. See UTF-8 bit by bit for why. The Unicode consortium has "outlawed" the use of non-shortest byte sequences for security reasons, with the practical consequence that byte values C0 and C1 must never occur in Unicode data. So Tcl will have to make sure that such characters (especially the NUL byte Donal mentioned) are converted to legal forms when exported.

See also: A simple Arabic renderer - Keyboard widget, A little Unicode editor for clicking exotic character Unicodes

LV: Can anyone provide a simple Tk example which shows display of the various characters available with this new support? In particular, my users have a need to display Kanji, Greek, and English (at least) on a single page. - RS: see i18n tester (which also became a Tk widget demo in 8.4).

 pack [text .t]
 .t insert end "Japan  \u65E5\u672C\n"
 .t insert end "Athens \u0391\u03B8\u03AE\u03BD\u03B1\u03B9"

Dirt simple. To get Unicodes in a more ergonomic way, use e.g. the Keyboard widget (that page has an example for a Russian typewriter), or a transliteration (see The Lish family, for instance Greeklish to Greek, Heblish to Hebrew), so you can write

 Athens [greeklish Aqh'nai]

for the same result as above - RS

Here's a simple wrapper for "encoding convertfrom"

 proc <- {enco args} {encoding convertfrom $enco $args}

so you can just say "<- utf-8 [binary format c* {208 184 208 183 208 187 208 190 208 182 208 181 208 189 208 184 208 181}]" - RS

DKF: It is more efficient to use the alias mechanism for this:

 interp alias {} <- {} encoding convertfrom
 interp alias {} -> {} encoding convertto

LV: Has anyone done any work reading and writing unicode via Tcl pipes? How about using unicode in Tk send commands to and from both older (pre Tcl 8) and new (Tk 8.3) applications? The reason I ask is that my users are reporting problems in both these areas - data loss in both cases (eighth bit lost in the case of pipes; incorrect encoding in the latter).

KBK: For pipes, it depends on what the target application generates or expects to see. Once you've opened a pipe, you should be able to [fconfigure $pipeID -encoding whatever] to inform Tcl of the encoding to be used on the pipe.

Not sure about [send] - I work mostly on platforms where it doesn't.

Latest news (Feb 7, 2001): From http://www.unicode.org/unicode/standard/versions/beta-ucd31.html

: Unicode 3.1 Beta has another 1024 "Mathematical Alphanumerics", and of course roughly 40000 more Kanji (CJK) listed. These pages are all beyond 16 bit (starting at 10300 for "Old Italic").

With UTF-8, we can handle such character codes with no changes at all; users of UTF-16 or a short wchar_t would be excluded. (RS)

Produce pure 1-byte encoding from Tcl: You can turn a UTF-8 encoded string into a strict 1-byte (and not proper UTF-8) string with [encoding convertfrom identity], as ::tcltest::bytestring demonstrates. This can be used if a C extension expects single-byte codes. The safer, but more complex way would be to use Tcl_UtfToExternalDString on the C side...

International Components for Unicode (ICU) http://www.icu-project.org/

is a C/C++ library for handling lots of issues that developers using Unicode encounter. Has anyone written a binding for Tcl for it yet?

Additional reading to consider regarding Unicode:

Schneier's Crypto-Gram on Unicode Security
the Unix/Linux Unicode FAQ
UTF-8 torture test (Tcl should be able to pass this)
RFC2279
a collection of unicode fonts

Mick O'Donnell: When you dont know which unicode set to use for a language, below is code which displays a unicode file in all available encodings:

 proc display-all-encodings {} {
    set file [tk_getOpenFile]
    text .t -bg white -font "Times"
    pack .t -fill both -expand t
    foreach encoding [lsort -ascii [encoding names]] {
        set str [open $file r]
        fconfigure $str -encoding $encoding
        .t insert end "\nEncoding: $encoding\n[read $str]"
        close $str
    }
 }

GPS in the Tcl chatroom: This is interesting: "The encoding known today as UTF-8 was invented by Ken Thompson. It was born during the evening hours of 1992-09-02 in a New Jersey diner, where he designed it in the presence of Rob Pike on a placemat (see Rob Pike's UTF-8 history)...." http://ask.km.ru/3p/plan9faq.html

(also http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt

)

Java also has internal support for UTF-8. The Java UTF Socket Communication page provides examples for socket communication between Tcl and Java through the readUTF() and writeUTF() methods.

does anyone know anything about "gb2321"?

HaO When receiving utf-8 data byte by byte one may convert the data by

 encoding convertfrom utf-8 $Data

If Data is not a complete utf-8 sequence (e.g. the last character sequence is incomplete), the upper command does not fail with an error. It will give an (in this sense) unexpected result.

To separate complete sequences and eventual remaining bytes, the following procedure UTF8FullCodes may be used. It takes all complete sequences of a buffer and returns them converted. An eventual incomplete sequence is left in the buffer. Usage example:

 set f [open file_utf8.txt r]
 fconfigure $f -encoding binary
 set InBuffer ""
 while {![eof $f]} {
    append InBuffer [read $f 100]
    # do something with raw bytes like CRC evaluation etc.
    set NewData [UTF8FullCodes InBuffer]
    # do something with NewData
 }
 close $f

Remark: of course, the file might be opened using

 fconfigure $f -encoding utf-8

and Tcl cares of everything. This issue is only important, if the raw bytes do not origin a stream device (in my case a DLL) and thus the encoding can not be used.

 proc UTF8FullCodes pBuffer {
        upvar $pBuffer Buffer
        # > Get last position index
        set LastPos [string length $Buffer]
        incr LastPos -1
        # > Current bytecount of a multi byte character includes the start character
        set nBytes 1
        # Loop over bytes from the end
        for {set Pos $LastPos} {$Pos >= 0} {incr Pos -1} {
                set Code [scan [string index $Buffer $Pos] %c]
                if { $Code < 0x80 } {
                        # > ASCII
                        break
                }
                if { $Code < 0xbf } {
                        # > Component of a multi-byte sequence
                        incr nBytes
                } else {
                        # > First byte of a multi-byte sequence
                        # Find number of required bytes
                        for {set Bytes 2} {$Bytes <= 6} {incr Bytes} {
                                # > Check for zero at Position (7 - Bytes)
                                if {0 == (( 1 << (7 - $Bytes)) & $Code)} {
                                        break
                                }
                        }
                        puts "Bytes=$Bytes"
                        if { $Bytes == $nBytes } {
                                # > Complete code
                                set Pos $LastPos
                                break
                        } else {
                                # > Incomplete code
                                incr Pos -1
                                break
                        }
                }
        }
        # > Convert complete sequence until Pos
        set Res [encoding convertfrom utf-8 [string range $Buffer 0 $Pos]]
        # > Put incomplete character in Buffer
        incr Pos
        set Buffer [string range $Buffer $Pos end]
        return $Res
 }

Lars H: A more compact way of checking that an UTF-8 octet sequence is a complete character is to use a regexp. This procedure checks whether a char is UTF-8:

 proc utf8 {char} {
    regexp {(?x)   # Expanded regexp syntax, so I can put in comments :-)
      [\x00-\x7F] |                # Single-byte chars (ASCII range)
      [\xC0-\xDF] [\x80-\xBF] |    # Two-byte chars (\u0080-\u07FF)
      [\xE0-\xEF] [\x80-\xBF]{2} | # Three-byte chars (\u0800-\uFFFF)
      [\xF0-\xF4] [\x80-\xBF]{3}   # Four-byte chars (U+10000-U+10FFFF, not suppoted by Tcl 8.5)
    } $char
 }

(This regexp can be tightened a bit if one wishes to exclude ill-formed UTF-8; see Section 3.9 ("Unicode Encoding Forms") of the Unicode standard.) See UTF-8 bit by bit for an explanation of how UTF-8 is constructed.

To match all complete UTF-8 chars at the beginning of a buffer use

    regexp {(?x)   # Expanded regexp syntax, so I can put in comments :-)
      \A (
      [\x00-\x7F] |                # Single-byte chars (ASCII range)
      [\xC0-\xDF] [\x80-\xBF] |    # Two-byte chars (\u0080-\u07FF)
      [\xE0-\xEF] [\x80-\xBF]{2} | # Three-byte chars (\u0800-\uFFFF)
      [\xF0-\xF4] [\x80-\xBF]{3}   # Four-byte chars (U+10000-U+10FFFF, not suppoted by Tcl 8.5)
      ) +
    } $buffer completeChars

[telgo] - 2010-07-23 21:23:34

Could I get one of you knowledgeable people to look at my question at Utf-8 difference between Windows and Mac? ? I am quite perplexed

[crusherjoe] - 2013-10-31 02:33:34

If you need to include utf-8 characters in your script, then follow the instruction given in http://www.tcl.tk/doc/howto/i18n.html

under "Sourcing Scripts in Different Encodings":

set fd [open "yourcode.tcl" r]
fconfigure $fd -encoding utf-8
set script [read $fd]
close $fd
eval $script

RS 2013-11-01 Easier: since Tcl 8.5, you can just write

source -encoding utf-8 yourcode.tcl

If you are using some other scripting language and you are communicating with tclsh/wish via a stdio, then include the following at the start of the text sent to tclsh/wish:

fconfigure stdin -encoding utf-8
fconfigure stdout -encoding utf-8

From then on you can display utf-8 encoded text in tk widgets.

wiwo 2016-12-09

I wrote a little script for translating 4 byte UTF chars to HTML entities.

proc utf8_encode_4byte_chars {s} {
  # info taken from https://de.wikipedia.org/wiki/UTF-8
    set result ""
    set chars_left 0;

    foreach i [split $s ""] {
        # get the decimal representation
        scan $i %c c

        # If the binary representation starts with 11110 this is a 4-byte char. 
        # The "1" before the first "0" show the number of bytes.
        if {$c>=240 && $c<=247} {
          # start of 4 byte entity
          set chars_left 4
        }
        if {$chars_left > 0} {
          if {$chars_left == 4} {
            # This is the first byte, which always start with "11110".
            # The last 3 bits are used for calculating the entity value.
            set bnum2 [expr { $c & 15 }]
          } else {
            # Following bytes always start with "10".
            # The last 6 bytes are used for calculating the entity value.
            set bnum2 [expr { ($bnum2 << 6)  + ( $c & 127 ) }]
          }

          if {$chars_left == 1} {
            # This is the last byte. We have gathered the full information.
            # format as hex html entity
            set entity_code "&#x[format %04.4X $bnum2];"
            
            append result $entity_code
          }

          incr chars_left -1
        } else {
          append result $i
        }
    }
    return $result
}

Usage:

set x "... string with 4 byte entities, can't be entered in this wiki ..."
puts [utf8_encode_4byte_chars $x]

Sample output

# This is a smiley: &#x1F600;! Of course there are more of them, e.g. &#x1F60D; &#x1F60F;! Still more: &#x1F643; &#x1F640; &#x1F63E; &#x1F63D; &#x1F63C; &#x1F63B; &#x1F63A; &#x1F639; &#x1F638; &#x1F637; &#x1F636; &#x1F635;

i18n - writing for the world - Arts and crafts of Tcl-Tk programming -

Category Characters

Category String Processing