Characters, glyphs, code-points, and byte-sequences

This page began as a posting by dkf to comp.lang.tcl recently. Readers are encouraged to add their own clarifications.

In article <[email protected]>, "Donal K. Fellows" <[email protected]> wrote:

 David Gravereaux wrote:
 > For me, I try to contort my thinking into pure glyphs, not an 8-bit number
 > that should mean a glyph.
 >
 > % puts \u00e0
 > à
 > (Tcl) 2 % puts \x85
 > ?
 > (Tcl) 3 % puts [encoding convertfrom cp437 \x85]
 > à
 > (Tcl) 4 % encoding system
 > cp1252

We really need to distinguish between characters, glyphs, code-points and byte-sequence here. A lower-case a with a grave accent is a character. The particular rendering of it on the screen (or on paper or wherever) with a particular font is a glyph. The code-point for a character is a number associated with a character in a character-set (which could in theory be arbitrary, but is usually based on some fairly systematic scheme) and the byte-sequence for a character is how the code-point for the character is represented in a file/channel.

People perceive glyphs, and (typically) abstract from there to characters. Computers perceive byte-sequences and code-points, and can be instructed to go from there to characters via what is known in Tcl as an encoding.

Now, \x85 is a byte sequence which corresponds to code-point 133 in some character set. From what you've said, it seems that it should be a character set that gives 133 the interpretation of a lower-case "a" character with a grave accent, but this is not a unique determinant factor (there's a family of encodings which all have this property.) We suggest either cp437 or cp850 as possible encodings.

There's more to this than what I've said, of course. There's some subtleties with naming and terminology that's not used universally. Then there's the fact that Tcl's internal character-set is UNICODE because that allows for the representation of a very wide range of common characters from many human languages, but this may be stored using two different forms using byte-sequences (UCS-16 which uses two bytes a character, and UTF-8 which uses a variable number of bytes a character but which can describe the ASCII subset using a single byte-per-char.) This has a number of advantages, the key one for this discussion is that Tcl has a standard notion of what a character is that is not overly confused (at a script level) by how a character is represented.

The key to cutting through this thicket of confusion is to focus on two things. Firstly, describe what character to use accurately to Tcl; make sure that it knows you've got an a-grave and not something else. Secondly, make sure that when characters get transferred between Tcl and the rest of the world (or vice versa) with the right encoding, which is typically done by using fconfigure (and the encoding command is a fall-back in some circumstances.)

Here is how to find encodings that have a specified mapping from an Unicode to a given character:

 proc encoding'find {unicode value} {
   set res {}
   foreach encoding [encoding names] {
       if {[encoding convertto $encoding $unicode]==$value} {
           lappend res $encoding    
       }
    }
    set res
 } ;# RS
 % encoding'find à [format %c 133]
 cp860 cp861 cp863 cp865 cp437 cp850 cp857

DKF - I'd prefer to express that above code as:

  proc encoding'find {characters codepoints} {
     set target {}
     foreach cp $codepoints {
        append target [format %c $cp]
     }
     set result {}
     foreach enc [encoding names] {
        if {[encoding convertto $enc $characters] eq $target} {
           lappend result $enc
        }
     }
     return $result
  }

  % encoding'find à 133
  cp860 cp861 cp863 cp865 cp437 cp850 cp857
  % encoding'find àà {133 133}
  cp860 cp861 cp863 cp865 cp437 cp850 cp857

Category Characters

Category Concept

Unicode and UTF-8