Updated 2018-08-17 13:19:04 by pooryorick

lsort, a built-in Tcl command, sorts elements in a list.

Synopsis  edit

lsort ?options...? list

Documentation  edit

official reference
TIP 326, Add -stride Option to lsort
New in version 8.6.
TIP 241, Case-Insensitive Switches and List Searching and Sorting
New in version 8.5
TIP 217, Getting Sorted Indices out of Lsort
New in version 8.5.

Description  edit

lsort returns a new list in sorted order. It uses a the merge-sort algorithm, which is a stable sort that has O(n log n) performance characteristics. By default, sorting is done in ASCII ascending order. The following options may be specified to control the sorting process:
-ascii
Use string comparison with ASCII collation order. This is the default.
-command command
Use command as a comparison command. To compare two elements, evaluate a Tcl script consisting of command with the two elements appended as additional arguments. The script should return a 32-bit signed integer less than, equal to, or greater than zero if the first element is to be considered less than, equal to, or greater than the second, respectively.
-decreasing
Sort the list in decreasing order ("largest" items first).
-dictionary
Use dictionary-style comparison. This is the same as -ascii except (a) case is ignored except as a tie-breaker and (b) if two strings contain embedded numbers, the numbers compare as integers, not characters. For example, in -dictionary mode, bigBoy sorts between bigbang and bigboy, and x10y sorts between x9y and x11y.
-increasing
Sort the list in increasing order ("smallest" items first). This is the default.
-index index
If this option is specified, each of the elements of list must itself be a well-formed list. Instead of sorting based on whole sublists, lsort extracts the index'th element from each sublist and sorts based on the that element. The keyword end is allowed for the index to sort on the last sublist element. For example,
-integer
Convert list elements to integers and use integer comparison.
-real
Convert list elements to floating-point values and use floating comparison.
lsort -integer -index 1 {{First 24} {Second 18} {Third 30}}

returns {Second 18} {First 24} {Third 30}. This option is much more efficient than using -command to achieve the same effect. (From: TclHelp)
-stride groupcount
interpret the list groups of lists, each list having groupcount elements. This saves the trouble of constructing such a list from a flat list when such semantics are needed.
-unique
If this option is specified, then only the last set of duplicate elements found in the list will be retained. Note that duplicates are determined relative to the comparison used in the sort. Thus if -index 0 is used, {1 a} and {1 b} would be considered duplicates and only the second element, {1 b}, would be retained. (From: TclHelp)

lsort -dictionary warning edit

RS: WARNING: lsort -dictionary can not be used for sensibly sorting signed integers. It returns first the negatives, then zero and the positives, but both in ascending absolute value order:
% lsort -dic {-5 -10  -1 0 3 1 2}
-1 -5 -10 0 1 2 3

So for a numeric maximum function, use the numeric qualifiers -integer or -real (which does decimal integers as well, just might take some longer).

lsort -real cannot sort non-decimal (i.e. hex or octal) integers, and lsort -integer cannot sort any number containing a decimal point or exponent (which cannot, in any case, be mixed with a hex specifier.)

willdye: Perhaps this behavior comes from trying to accomodate strings which use dashes as a general separator. For example, here's a list of dates in the widely-used [ISO8601] (YYYY-MM-DD) format:
% lsort -dictionary {2005-01-01 2005-01-02 2005-01-03 2005-01-04}
2005-01-01 2005-01-02 2005-01-03 2005-01-04

(See Parsing ISO8601 dates and times if you want details about the format). I agree with RS, however, that the list -5 -10 -1 0 3 1 2 should ideally sort to -10 -5 -1 0 1 2 3. Fortunately, the behavior is easy to work around.

lsort performance  edit

KBK 2001-02-14:

One thing to remember when using lsort is that the performance of lsort -command is pretty awful. If you can pre-convert your list into something that can be sorted with another form of lsort, it's almost always faster to do so.

Example: Over in Things British, RS posted code to sort a list of surnames according to the rules for a British telephone directory. The code worked something like:
proc comp.uk.phone {x y} {
    regsub {Ma?c ?([A-Z])} $x {M_\1} x
    regsub {Ma?c ?([A-Z])} $y {M_\1} y
    string compare $x $y
}

proc sort.uk.phone { list } {
    return [lsort -command comp.uk.phone $list]
}

It's a good bit faster to sort the list by making a list of sort key and value, sorting this list with lsort -index, and then extracting just the values, thus:
proc sort.uk.phone2 { list } {
    foreach name $list {
        regsub {Ma?c ?([A-Z])} $name {M_\1} key
        lappend list2 [list $key $name]
    }
    foreach pair [lsort -index 0 -ascii $list2] {
        lappend list3 [lindex $pair 1]
    }
    return $list3
}

OK, let's run the two methods head to head:
set list {MacDonald McArthur McEwan Lyttle Mabbs Jones}
set l 6
puts " Length | Method 1 | Method 2 | Ratio (m1 / m2)"
puts " -------+----------+----------+----------------"
while {$l <= 25000} {
    if {$l > 100} {
        set count 1
    } else {
        set count 100
    }
    set m1 [lindex [time {sort.uk.phone $list} $count] 0]
    set m2 [lindex [time {sort.uk.phone2 $list} $count] 0]
    set ratio [expr {double($m1) / double($m2)}]
    puts [format " %6d | %8d | %8d | %9.2f" $l $m1 $m2 $ratio]
    incr l $l
    lappend list {*}$list
}

The results are shown below:
Length | Method 1 | Method 2 | Ratio (m1 / m2)
-------+----------+----------+----------------
     6 |     1410 |      500 |      2.82
    12 |     4200 |     1000 |      4.20
    24 |    10820 |     1900 |      5.69
    48 |    27140 |     3710 |      7.32
    96 |    65890 |     7310 |      9.01
   192 |   160000 |    10000 |     16.00
   384 |   361000 |    30000 |     12.03
   768 |   771000 |    60000 |     12.85
  1536 |  1783000 |   120000 |     14.86
  3072 |  3805000 |   251000 |     15.16
  6144 |  8352000 |   571000 |     14.63
 12288 | 18166000 |  1061000 |     17.12
 24576 | 39077000 |  2153000 |     18.15

DKF This technique is (I believe) known as creating a collation key. Also note that this comparison is not entirely fair, as it is performing additional regsub'''s in the -command case, though it is easy to see from a comparison of lsort $someList with lsort -command {string compare} $someList that it is still painful to use -command when you don't have to.

- See Custom sorting on a wrapper for collation key sorting.

Sorting on multiple indexes edit

How does one sort a multi-index? By that is meant something like, "sort by last name, then first name, then age". lsort doesn't offer direct support for multi-indexes. The straightforward approach, therefore, is to use -command. However, Bruce Hartweg wisely observes that, as lsort sorts stably, an efficient multi-index sort is achieved by cascading lsort applications--least significant first, of course. See Additional list functions for multisort which packages such cascaded sorts into one call.

In the same vein, Andreas Kupries wrote on 2001-03-21 in comp.lang.tcl:

The lsort command uses a 'stable' mergesort. This means that it is possible to sort something after more than one key and the later sorts will not destroy the order of the previous sorts/keys. Quicksort for example is not stable (DKF: it also has some rotten worst-case behaviour, unlike mergesort.)

So, just sort after the secondary key first and then after the primary key. like so:
set f {{12 11} {12 13} {12 12} {11 14} {13 12}}

puts [lsort -index 1 $f]
puts [lsort -index 0 [lsort -index 1 $f]]
{12 11} {12 12} {13 12} {12 13} {11 14}
{11 14} {12 11} {12 12} {12 13} {13 12}

DKF: If you are doing this on a lot of data, it is far more efficient to convert the keys into an additional collation key (as described above in relation to performance-measuring) and sort on that, as this adds two O(n) (i.e. linear) operations to generate and strip the collation keys and removes an O(n log n) sort.

KBK: Custom sorting has a worked example of doing what DKF recommends above.

JRW: Both approaches are asymptotically equivalent. If you are trying to sort records on k fields, then either you do one sort where each sort comparison involves comparing k fields, or you do k (stable) sort passes where each comparison involves one field. Either way, k n log n comparisons. (KBK But the constant factor for lsort -command is horrible, see above. Do everything that you can to avoid using lsort -command, and you will be happier in the long run.)

lsort -command  edit

RS:

One thing lsort -command is good for is to introspect how lsort works. Consider this silly sort function:
proc foo {a b} {puts [list $a $b]; return 0}

which constantly says its inputs are equal, but logs them to stdout:
lsort -command foo {1 2 3 4 5 6 7 8 9 10}
1 2
3 4
1 3
2 3
5 6
7 8
5 7
6 7
1 5
2 5
3 5
4 5
9 10
1 9
2 9
3 9
4 9
5 9
6 9
7 9
8 9

Makes it pretty clear how a merge sort works, doesn't it?

FW: Why is -command so slow, anyway?

NEM 2009-03-22: I profiled this at one point. The overhead comes partly from calling a command: looking it up, checking for any command/execution traces, etc. You can eliminate some of this, but not all. Another big killer is simply having to pack/unpack the comparison result (-1,0,1) into a Tcl_Obj. That puts some allocs/frees into the critical loop. Making this run fast would probably require some heavy-duty re-engineering of large parts of how Tcl deals with values and commands.

lsort and unicode  edit

bll 2016-1-10: See collation for information on how to sort in the locale's collation sequence.

LV 2003-06-02:

Does lsort support l10n considerations - such as sorting \u0063 equivalent to \u00e7 - at least in a French locale?

RS: No, sorry. One would have to introduce custom sorting explicitly depending on requirements (see e.g. Things British for British phone-book sorting).

HolgerJ 2003-10-13: I'm still amazed by the fact that lsort -dictionary does not sort by Unicode dictionary order. I know that the algorithm is quite complex and language dependent (one would have to read the environment to find out the language), but at least the most common things could be considered - like Umlauts and French accents. lsort -dictionary should be renamed lsort -ignorecase or the like.

So, if you want to sort according to dictionary rules, you have to supply your own strcmp-like function or re-implement the whole lsort command. The latter might be faster, but not so convenient.

RS Is there a Unicode dictionary order, except for the numeric codes (which lsort does)? As far as I know, different languages have different collation orders, e.g. "Ä" might be sorted just as A, or as AE, in German, while in Swedish it sorts after Å... which sorts after Z...

HolgerJ 2003-10-19: No, see above. "Ä" does principally sort like "A" in German, but the trema (dots) are used as a tiebreaker. This is called secondary sorting order. For details see http://www.unicode.org/unicode/reports/tr10/ Quotation: 'This is worth emphasizing: the sorting weight of characters is not provided by their position in the Unicode code charts. ' Hopefully, this will be implemented in C the Tcl core. In Tcl it would bee too slow, I think.

The following idea of Lars H. looks fine, but lacks support for the secondary (and tertiary) order.

Lars H: Whereas I agree the Unicode support in Tcl is incomplete, I don't agree on the particular example (dated 2003-10-13 by Holger J above). The Unicode Standard [1] is littered with remarks that issues such as this should not be handled by Unicode as such, but by a higher level mechanism. One could have an option to lsort that specifies a command that should be applied to the sort keys before they are compared, or perhaps a string map style list to normalize the characters, but the Tcl core has no business implementing language-specific sorting as such. Also, -dictionary does far more than the old (Tcl7) -ignore option of lsort did (e.g., it places a2 before a10).

Currently, this is what you should do to get language-specific sorting
set temp_list [list]
foreach item $list_to_sort {
    lappend temp_list [list $item [string map $collateList $item]]
    # $collateList is what handles language-dependence 
}
set sorted_list [list]
foreach item [lsort -dictionary -index 1 $temp_list] {
    lappend sorted_list [lindex $item 0]
}

$sorted_list is now sorted in the custom order. For Swedish, the collateList should probably be something on the lines of
set collateList {å \x80 Å \x80 ä \x81 Ä \x81 ö \x82 Ö \x82 W V w v é e É E}

I need to do something special for ÅÄÖåäö, because the native order in latin-1 (and hence in Unicode) is wrong for Swedish.

What I miss most in Tcl's Unicode support is some way of normalizing strings (at least to some of the standard normalisation forms defined in the standard and its annexes). decomposed and precomposed subcommands of string appear to me as the most obvious way to do this.

LV: If someone is looking for a project, research http://www.unicode.org/reports/tr10/tr10-10.html (or whatever variation is the latest of the document) - this is a technical note by the Unicode folks on the fun of collation of unicode strings. A Tcl extension providing an interface to the algorithm would be a wonderful contribution.

lsort in-place reference  edit

AMG: If you want to modify a sorted list in-place and keep the result sorted, check out Sorted Lists for alternatives to repeatedly calling lsort.

lsort merge edit

KBK 2005-03-29 in the Tcl chatroom: Since Tcl's merge exploits natural runs, you can also merge two lists without an extra merge command to do it:
lsort [concat $list1 $list2]

works in O(n) time if $list1 and $list2 are already sorted.

multisort  edit

MG 2005-07-29: I have a list, where each element is itself a two-element list, and I want to sort based on both elements, one after the other. It seems that the -index option to lsort can only take a single index, though - I'd been thinking of trying something like this
% lsort -index {1 0} [list {f 5} {a 5} {b 3}]
{b 3} {a 5} {f 5}

but it raises an error. Should I just use a custom proc and the -command option, or is there a better way that I'm not aware of? Thanks in advance for your help :)

Peter Newman 2005-07-29: Search for multisort above in this page and in Additional list functions.

-integer and values that are not integers  edit

Note that before Tcl 8.5, some argument combinations were not correct, but were not reported as an error. For example
lsort -integer numlist

which is a mistake in a couple of ways, generates an error now where in the past the fact that no integer arguments were provided was not raised as an error.

Sort by string length  edit

AMG: Here's how to sort a list by ascending string length. It demonstrates the use of [apply] for lambdas.
set data {this is tcl}
lsort -command {apply {{a b} {expr {[string length $a] - [string length $b]}}}} $data
# returns "is tcl this"

lsort -unique -indices  edit

AMG: The Tcl 8.6b2 man page for lsort doesn't clearly document what exactly happens when -unique and -indices are used together. Here's how it works: the index of the last duplicate element is returned. The man page does make mention of this behavior in connection to custom comparisons (the example used is -index 0), but you have to extrapolate from there to guess that lsort returns the index of the last duplicate when -indices is used. When neither a custom comparison nor -indices are used, it completely does not matter which duplicate is returned.

See Also  edit

Custom sorting
sort using a collation key
list
Topological sort
How fast can we sort 5 elements
lsort index vector
Forward-compatible lsort
add support for -stride to Tcl 8.5