Updated 2011-12-18 19:59:52 by RLE

Purpose: The Tcl maintainers are working on an issue where overflow of a 32-bit integer, followed by conversion to wide, can yield a literal with an inappropriate internal representation. KBK has been asked to start a Wiki page to track the discussions, since they are getting complex enough to be difficult to track in the Chat and email, and a Wiki page supports collaborative editing, unlike a SourceForge bug report.

Background: Tcl 8.4 has a serious bug [1] in its processing of integer literals that exceed the range of a 32-bit word. The issue is that, for backward compatibility with earlier versions of Tcl, integers that fit into a 32-bit unsigned word are treated as 32-bit constants. The problem that arises is that these constants acquire an internal representation that can then be sign-extended to a wide integer. The wide integer will have an incorrect value. [2] is a related bug.

The bug is truly insidious because it pollutes shared literals, so unrelated code can stumble over problems. The following illustrates the sort of bizarre results the literal pollution can cause.
 % proc b {} { set x 2200000000 ; puts [expr { wide($x) + 1 }] }
 % b
 2200000001
 % proc a {} { clock format 2200000000 }
 % a; b
 -2094967295

Certain changes to [string is integer] in 8.4.3 and later releases have made the following script also fail similarly, although it works from 8.0 to 8.4.2:
 % proc b {} { set x 2200000000 ; puts [expr { wide($x) + 1 }] }
 % b
 2200000001
 % proc a {} { string is integer 2200000000 }
 % a; b
 -2094967295

What's happening: The following table gives examples of each type of 32-bit integer conversion that's possible, and the notes below explain what is going on.
 --------------------------------------------------------------
 Constant       32-bit                                    Note
                representation
 -------------------------------------------------------------
 -0x100000000   -- integer value too large to represent -- *1
 -------------------------------------------------------------
 -0xffffffff    0x1                                        *2
 -0x80000001    0x7fffffff                                 *2
 -------------------------------------------------------------
 -0x80000000    0x80000000                                 *3
 -0x7fffffff    0x80000001                                 *3
 -0x1           0xffffffff                                 *3
 -------------------------------------------------------------
 -0x0           0x0                                        *4
  0x0           0x0                                        *4
  0x1           0x1                                        *4
  0x7fffffff    0x7fffffff                                 *4
 -------------------------------------------------------------
  0x80000000    0x80000000                                 *5
  0xffffffff    0xffffffff                                 *5
 -------------------------------------------------------------
  0x100000000   -- integer value too large to represent -- *6
 -------------------------------------------------------------

  • 1 - Numbers less than -2**32+1 cannot be represented as 32-bit integers. Tcl rejects these numbers in any context requiring a 32-bit integer. In 8.4 and beyond, these numbers can have a "wide" internal representation as a 64-bit number.
  • 2 - Numbers between -2**32+1 and -2**31-1 cannot be represented as 32-bit integers, but we suspect that they appear occasionally in scripts that intend to treat them as 32-bit constants (-0xffffffff is an example). They are handled by converting the absolute value as an unsigned integer, and then two's-complementing the result. This is a case where the "wide" and "integer" internal representations disagree.
  • 3 - Numbers between -2**31 and -1 are negative signed integers that fit conveniently in a 32-bit signed word. They do not cause trouble with conversion between "integer" and "wide."
  • 4 - Numbers between 0 and 2**31-1 are positive signed integers that fit conveniently in a 32-bit signed word. They do not cause trouble with conversion between "integer" and "wide."
  • 5 - Numbers between 2**31 and 2**32-1 are positive integers that require a 32-bit unsigned word to represent. Tcl 8.x converts them to an "integer" internal representation. This is a case where the "wide" and "integer" internal representations disagree.
  • 6 - Numbers greater than or equal to 2**32 are positive integers that do not fit in a 32-bit word. Tcl rejects them in a context where 32-bit integers are required, but will convert them to "wide."

The significant cases above are 2 (numbers between -0x80000001 and -0xffffffff) and 5 (numbers between 0x80000000 and 0xffffffff). In both these cases, the "integer" internal representation, if sign extended to "wide", will result in an incorrect value. In both cases, the "wide" value will be correct if the sign bit is complemented before sign extension. The following table gives examples for each case.
 Constant       Incorrect               Correct                 Note
 -------------------------------------------------------------------
 -0xffffffff    0x0000000000000001      0xffffffff00000001       *2
 -0x80000001    0x000000007fffffff      0xffffffff7fffffff       *2
 -0x80000000                            0xffffffff80000000       *3
 -0x00000001                            0xffffffffffffffff       *3
 -0x00000000                            0x0000000000000000       *3
  0x00000000                            0x0000000000000000       *4
  0x00000001                            0x0000000000000001       *4
  0x7fffffff                            0x000000007fffffff       *4
  0x80000000    0xffffffff80000000      0x0000000080000000       *5
  0xffffffff    0xffffffffffffffff      0x00000000ffffffff       *5
 -------------------------------------------------------------------

So, how to fix this bug? KBK's initial thought is to introduce another object type in tclObj.c: tclOverflowedIntType. This object type will represent objects that were converted on input to 32-bit integers with overflow. It will behave identically to tclIntType in that Tcl_GetIntFromObj will return the 32-bit value. But the places in the Core where an integer representation is retrieved and then sign extended to wide will change to sign extend with the complement of the sign bit, as shown above in 2 and 5.

It is KBK's belief that this change should not break existing scripts, since they will see the same 32-bit behavior that they did before. It should also not break existing extensions, even those that reach into the internal representation; the worst that it will cause is to make them do needless calls to convert the type.

One remaining issue with this idea is the question of what Tcl_ConvertToType should do if requested to convert one of these overflowed integers to tclIntType. KBK is of the belief that the most backward-compatible action is probably to have it silently convert to tclOverflowedIntType instead; any code that is expecting the internal representation afterward will see the correct data in objPtr->internalRep.longValue and will only notice the difference if it explicitly checks objPtr->typePtr. A riskier alternative is to return TCL_ERROR with a message indicating that the value is too large to represent. KBK believes that both alternatives are low-risk, because there are few if any callers for Tcl_ConvertToType - no Core caller ever requests an explicit integer conversion in this manner.

Jacob Levy KBK's tcl-core message sounded more alarming than it appears here. For starters, where in the core did this bug manifest itself? And does this only affect TclInt or is it, as DKF hints, an issue with setting the type of the object incorrectly?

In any case, I'm hoping that any fix for this will not break existing extensions that use the core's internals such as Feather, Jacl, and e4Graph.

DKF: Rummaging around in Tcl's guts is not a great thing to do. Only the owner code of a type should look at the internal rep of objects of that type.

Jacob Levy DKF I'm not sure what the above means, please explain.

DGP Here's a simple example of what DKF means. Say you've been handed a Tcl_Obj, and you want to store its integer value in a C variable of type int. The right way to do that is:
  Tcl_GetIntFromObj(interp, objPtr, &value);

The wrong way to do that is:
  Tcl_ConvertToType(interp, objPtr, Tcl_GetObjType("int"));
  value = (int) objPtr->internalRep.longValue;

The idea is that Tcl knows how it stores internal representations of integers. Let it pull out the value for you. Don't try to do Tcl's work for it (and in the process create future breakage if Tcl ever changes its mind).

Snipped the rest which was veering into a different topic.

Lars H: IMHO, these bugs demonstrate that 32-bit integer overflow is something that is alien to the spirit of Tcl. Integers in Tcl should be proper integers (mathematical Z) rather than some standard C datatype. Proper integers for Tcl 8.5 or 9.0 is a sketch for how that could be achieved (comments welcome).

HTL: Are the bug IDs correct in the background section? The first bug was closed in 2004 and the second got to an error page.