Updated 2017-09-27 18:35:59 by kpv

This page describes various string processing that exist beside the standard commands

Commands  edit

namedescription
adjust
tcllib::textutil::adjust
tcllib::textutil::blank
tcllib::textutil::cap
tcllib::textutil::chop
tcllib::textutil::getPredefined
tcllib::textutil::indent
tcllib::textutil::listPredefined
tcllib::textutil::longestCommonPrefix
tcllib::textutil::longestCommonPrefixList
tcllib::textutil::readPatterns
tcllib::textutil::repeat
tcllib::textutil::splitn
tcllib::textutil::splitx
tcllib::textutil::tabify
tcllib::textutil::tabify2
tcllib::textutil::tail
tcllib::textutil::trim
tcllib::textutil::trimEmptyHeading
tcllib::textutil::trimleft
tcllib::textutil::trimprefix
tcllib::textutil::trimrigth
tcllib::textutil::uncap
tcllib::textutil::undent
tcllib::textutil::untabify
tcllib::textutil::untabify2
Poolby Andreas Kupries, has various string functions, some of which are probably now included in tcllib::textutil
stringDistancemeasure the similarity between two strings
findOverlapif andhow many of the last lines in $str1 match the first lines in $str2
countcount characters in a string]
case:titleCreate WP formatted title strings on the active selection
Longest common substring
ycl::string::cmpLike the Unix cmp command, compares two strings and returns the index at which they differ, or -1 if they are identical.
ycl::string::delimitsplit (partition) a string into substrings using any combination of string, match, or regular expressions, returning both the substrings and the delimiters.
ycl::string::pstringconverts non-printing characters in a string to their hexadecimal escape sequences
ycl::string::shortmatchlike [string match], but returns the index of the last character of the shortest match, or -1
ycl::tcl::armourConvert a value into a printing string that can be pasted directly into a Tcl script as a value

ASCII map  edit

No algorithm at all, but may come in handy ;-)

If you're on a UNIX box, try: man ascii
proc ascii {} {return {
    00 nul  01 soh  02 stx  03 etx  04 eot  05 enq  06 ack  07 bel
    08 bs   09 ht   0a nl   0b vt   0c np   0d cr   0e so   0f si
    10 dle  11 dc1  12 dc2  13 dc3  14 dc4  15 nak  16 syn  17 etb
    18 can  19 em   1a sub  1b esc  1c fs   1d gs   1e rs   1f us
    20 sp   21  !   22  "   23  #   24  $   25  %   26  &   27  '
    28  (   29  )   2a  *   2b  +   2c  ,   2d  -   2e  .   2f  /
    30  0   31  1   32  2   33  3   34  4   35  5   36  6   37  7
    38  8   39  9   3a  :   3b  ;   3c  <   3d  =   3e  >   3f  ?
    40  @   41  A   42  B   43  C   44  D   45  E   46  F   47  G
    48  H   49  I   4a  J   4b  K   4c  L   4d  M   4e  N   4f  O
    50  P   51  Q   52  R   53  S   54  T   55  U   56  V   57  W
    58  X   59  Y   5a  Z   5b  [   5c  \   5d  ]   5e  ^   5f  _
    60  `   61  a   62  b   63  c   64  d   65  e   66  f   67  g
    68  h   69  i   6a  j   6b  k   6c  l   6d  m   6e  n   6f  o
    70  p   71  q   72  r   73  s   74  t   75  u   76  v   77  w
    78  x   79  y   7a  z   7b  {   7c  |   7d  }   7e  ~   7f del
}} ;#RS

Linebreak  edit

Break a string into lines of specified maximum length: join the output of this with \n for hard-wrapped text:
proc linebreak {s {width 80}} {
   set res {}
   while {[string length $s]>$width} {
       set     pos [string wordstart $s $width]
       lappend res [string range     $s 0    [expr $pos-1]]
       set     s   [string range     $s $pos end]
   }
   lappend res $s
} ;# RS

Arjen Markus An elegant solution, but two remarks:

  • The file word.tcl remarks that word boundaries are platform-dependent. Is this also used in string wordstart?
  • More importantly: The above fails if a word is longer than the given width! You get into an endless loop.

word Wrap Via Tk Text Widget  edit

D. McC: A problem with string wordstart, if you want hard-wrapped text that reliably looks right, is this (from the string wordstart page on the Wiki): "A word is considered to be any contiguous range of alphanumeric (Unicode letters or decimal digits) or underscore (Unicode connector punctuation) characters, or any single character other than these" (emphasis added). So, if you have a word (in the ordinary sense) preceded or followed by punctuation marks, string wordstart will treat the punctuation marks as separate "words," and they may not come out on the same line as the word they go with!

The Tk text widget, with word wrap on, displays wrapped text that reliably looks right, with no punctuation marks separated from the adjoining words. So, to get hard-wrapped text that looks right, you can determine the locations where the text widget ends the wrapped lines, and replicate the text with newlines at those locations. Here's some code I wrote that does this (feel free to suggest improvements). A global variable "formawid" is set to hold the desired width in characters of the wrapped text, and the text widget (here, ".tx") is configured to that width, with word wrap:
.tx configure -width $formawid -wrap $wordwrap
wm geometry . {} ; # Make sure the toplevel shrinks or expands to fit

Then, after a delay of one-tenth of a second (which seems to be needed for some reason), the following procedure is run. The "whattodo" arg can be either "print" (to prepare hard-wrapped text for printing) or "show" (to display hard-wrapped text in the text widget). If there is a selection, only the selected text will be hard-wrapped; if there isn't, all text in the widget will be hard-wrapped.
proc formanew {whattodo} {
    global formawid
    # Identify beginning and end of text to format:
    if {[.tx tag ranges sel] eq {}} {
        set selon 1.0
        set seloff [.tx index end]
    } else {
        set selon [.tx index sel.first]
        set seloff [.tx index sel.last]
    }
    set texin [expr int($selon)]
    set texend [expr int($seloff)]
    # Initialize variable to hold output:
    set formatext {} 
    for {set i $texin} {$i <= $texend} {incr i} {
        # Get text to newline:
        set endolin [.tx index $i.end]
        set endochar [lindex [split $endolin .] end]
        set whatline [.tx get $i.0 $endolin]
        # If line is blank, insert only newline into output:
        if {[string trim $whatline] eq {}} {
            append formatext \n
            continue
        }
        # If not, then find out where line is wrapped:
        for {set c 1} {$c <= $endochar} {incr c} {
            .tx see $i.$c
            set ceemin [expr {$c-1}]
            set boxie [.tx get $i.$ceemin]
            # Get y coordinates of bounding boxes for adjoining characters:
            set pixy [lindex [.tx bbox $i.$ceemin] 1]
            set nexy [lindex [.tx bbox $i.$c] 1]
            # If y coordinate of bounding box is greater than for
            # preceding character, line has been wrapped, so
            # insert preceding character plus newline into output:
            if {$nexy > $pixy} {
                append formatext $boxie\n
                .tx see $i.$c
            } else {
                # Otherwise, insert only the preceding character:
                append formatext $boxie
            }
        }
        # Replicate existing newline from text widget:
        if {$i < $texend} {
            append formatext \n
        }
    }
    if {$whattodo eq {print}} {
        return $formatext
    } else {
        .tx delete $selon $seloff
        .tx insert $selon $formatext
        .tx edit separator
    }
}

after 100 formanew

hv If you want to format the lines, look into textutil::adjust. It can do left-, right-, or center-justify, adjust the text width, and do hyphenation.

Occurrence Count  edit

Count number of occurrences of a substring in a string:
proc scount {subs string} {
    regsub -all $subs $string $subs string
}
proc scount2 {subs string} {
    regexp -all $subs $string ;# 8.3
}

The latter can also be defined by currying (see Custom curry):
interp alias {} scount3 {} regexp -all

JH writes, in a comp.lang.tcl thread talking about this subject
proc scount4 {subs string} {
    regexp -all ***=$subs $string
} 

and AM initially proposes:
proc scount5 {subs string} {
    set count [llength [split [string map [list $subs \uFFFF]  $string] \uFFFF]]
    incr count -1 
}

but that doesn't make use of $string, so something is missing...

Stephane A. writes:
proc countstrings {data search} {
    set l [string length $search]
    set count 0
    while {[set i [string first $search $data]]>=0} {
        incr count
        incr i $l
        set data [string range $data $i end]
    }
    set count
} 

Reverse a String  edit

proc sreverse s {join [lreverse [split $s {}]] {}} ;#RS

where lreverse is of course on Additional list functions...

FW notes: this goes through a transitional list form. Still being a CPU speed and memory junkie, I feel obliged to correct this :P I'll call it sreverse:
proc sreverse s {
    set res {} 
    set i [string length $s]
    while {$i >= 0} {
        set res "$res[string index $s [incr i -1]]"
    }
    return $res
}

RS agrees, except that append may be more efficient:
proc sreverse s {
    set l [string length $s]
    set res {} 
    while {$l} {append res [string index $s [incr l -1]]}
    set res
}

LV notes that two more solutions recently appeared on comp.lang.tcl: The first is by Michael Schlenker:
proc string_reverse str {
    set rts {} 
    for {set i [string length $str]; incr i -1} {$i >= 0} {incr i -1} {
        append rts [string index $str $i]
    }
    return $rts
}

and the second by [(R. T. Wurth]:
proc srev s {
    return [join [::struct::list reverse [split $s {}]] {}]
}

Am 2005-12-13: The question came up again in the chatroom yesterday, and I decided to measure the performance of various alternatives ... See: Performance of string reversing algorithms

Longest Common Prefix  edit

Also found in tcllib::textutil

Arjen Markus At times I have had the need (not too urgent though) of a function/proc to determine the position where two strings (or lists) become different:
   "Arjen Markus" versus "Arjen Marcus" --> position: 9
   "Arjen Markus" versus "Arjan Markus" --> position: 3

(though mostly these are cases where my name and the variations that I commonly encounter are not involved) - RS: That would be the length of the common prefix which one might implement like this:
proc commonPrefix {a b} {
    set res {}
    foreach i [split $a {}] j [split $b {}] {
        if {$i eq $j} {append res $i} else break
    }
    set res
}

Here is a generalized version that takes any number of strings and returns the prefix which all have in common:
proc longestCommonPrefix strings {
   set res {}
   set i 0
   foreach char [split [lindex $strings 0] {}] {
       foreach string [lrange $strings 1 end] {
           if {[string index $string $i] ne $char} {
               return $res
           }
       }
       append res $char
       incr i
   }
   set res
} ;# RS

MGS 2004-05-13: Here's another way - sort a list of strings and then compare the first and last:
proc string:common {string1 string2} {
    set i 1
    while {[string equal -length $i $string1 $string2]} {incr i}
    return [string range $string1 0 [expr {$i-2}]]
}

proc string:common:list {args} {
    if { [llength $args] < 2 } {
        return -code error "wrong # args: must be >= 2"
    }
    set list [lsort $args]
    return [string:common [lindex $list 0] [lindex $list end]]
}

[JEC] 2010-03-18, notes the above string:common has a bug and never returns on identical strings. Even when fixed it does O(n^2) comparisons

BBH 2004-05-13: This got me thinking that RE back references would be a perfect fit, a little playing & it turns out I was right ;)
proc prefix {s1 s2} {
    regexp {^(.*).*\0\1} "$s1\0$s2" all pref
    return $pref
}

and easily generalized for multiple words
proc prefix {str args} {
    set re {^(.*).*}
    foreach s $args {
        append re {\0\1.*}
        append str "\0$s"
    }
    regexp $re $str all pref
    return $pref
}

NEM 2005-03-17: Another RE variation:
proc reprefix {str1 str2} {
    regexp -inline ^[join [split $str1 {}] ?]? $str2
}

NEM 5mins later. 'Tis broken:
% reprefix iamadonut iamatoilet
iamat

Whoops.

Subscripts  edit

The characters for which subscript versions exist in Unicode (digits, parens, some operators -see the list in code) are converted to their subscripted Unicodes. Others are left unchanged. (For superscripts the positions 1,2,3 seem not to be filled in my installation, so I didn't add the corresponding code...)
proc subscript s {
    set res {} 
    foreach char [split $s {}] {
        set pos [lsearch -exact {0 1 2 3 4 5 6 7 8 9 + - = ( )} $char]
        if {$pos>=0} {set char [format %c [incr pos 0x2080]]}
        append res $char
    }
    set res
} ;# RS
% puts H[subscript 2]O
H?O

Strictly a Floating Point Number  edit

the regular string is double term fires on integers too. The following fires on real floats only:
proc isFloat x {expr {[string is double -strict $x] && ![string is int $x]}} ;#RS

Sort a String  edit

AM: For some symbolic manipulations (w.r.t. group theory, oh just for the fun of it), I need to sort the characters in a string, so that for instance ababa becomes aaabb.

This can be done via:
proc charsort {string}  {
   return [join [lsort [split $string {}]] {}]
}

Longest String  edit

Here's a little proc to find the length of the longest string in a list:
proc maxlen args {
    # Written 2003 by Ed Suominen, hereby placed in the public domain
    if {[llength $args] > 1} {
        if {
            [set x [string length [lindex $args 0]]] -
            [set y [string length [lindex $args 1]]] < 0
        } {
            return -1
        } elseif {$x == $y} {
            return 0
        } else {
            return 1
        }
    } else {
        set x [lsort -decreasing -command maxlen [lindex $args 0]]
        return [string length [lindex $x 0]]
    }
}

Alternative:
proc maxlen args {
    if {[llength $args] == 1} {
        set args [lindex $args 0]
    }
    set res [string length [lindex $args 0]]
    foreach i [lrange $args 1 end] {
        set l2 [string length $i]
        if {$l2 > $res} {set res $l2}
    }
    set res
} ;# RS

KPV one liner using lmap:
set words {now is the time for all good men to come to the aid of their country}
set how_long [::tcl::mathfunc::max {*}[lmap v $words {string length $v}]]
set longest_word [lindex [lsort -index 1 -integer [lmap v $words { list $v [string length $v] }]] end 0]

String Insert  edit

Insert a string into another at a given position, analogous to linsert:
proc strinsert {string pos char} {
    set original [string index $string $pos]
    string replace $string $pos $pos $char$original
} ;# RS
% strinsert hello 1 x
hxello

AMG: Why doesn't this come standard in the string command?

Also, don't let the formal parameter name "char" fool you. With this code, the string to be inserted can be any length. I'd write it this way:
proc strinsert {string pos insertion} {
    append insertion [string index $string $pos]
    string replace $string $pos $pos $insertion
}
namespace ensemble configure string -map\
    [dict replace [namespace ensemble configure string -map] insert strinsert]
% string insert hello 1 xxx
hxxxello

Or, even faster:
proc strinsert {string pos insertion} {
    string replace $string $pos $pos $insertion[string index $string $pos]
}

AMG: See TIP 475 [1] for a new [string insert] command.

N-th Occurrence of a Substring  edit

Silas 2005-10-14: If you want to find the third or fourth occurrence of a string in another string, you'll have to use string first many times. The following proc could help:
proc mystringFirst {substring mystring ocorrencia} {
    if {!$ocorrencia} {return -1}
    set index 0

    for {set i 0} {$i < $ocorrencia} {incr i} {
        set index [string first $substring $mystring]
        set mystring [string range $mystring [expr $index + 1] [string length $mystring]]
    }

    return $index
}

MG 2017-08-17 This doesn't seem to work correctly (or at least gives very unexpected results for me):
% set str "foo bar baz boing sprocket banana"
% mystringFirst ba $str 1
4
% mystringFirst ba $str 2
3

A couple of alternatives that give the results I would expect:
proc mystringFirst2 {needle haystack count} {
        set pos -1
        while { $count && [set pos [string first $needle $haystack $pos+1]] > -1 } {
                incr count -1
        }
        return $pos
}

proc mystringFirst3 {needle haystack count} {
        set indices [regexp -all -inline -indices $needle $haystack]
        if { [llength $indices] >= $count } {
                return [lindex $indices $count-1 0]
        } else {
                return -1
        }
}

mystringFirst3 obviously takes a regexp needle, rather than a plain string, but if you add
  set needle [regsub -all {([^a-zA-Z0-9 ])} $needle {\\\1}]

to the start it'll work on a plain string instead.

String to Proper English Title  edit

D. McC 2005-10-20: The "string totitle" subcommand only, and always, capitalizes the first letter in a string, no matter how many words are in the string or what the words are. Here's some code to convert a multiple-word expression in English to a properly capitalized title, in which all initial letters are capitalized except those of articles, conjunctions, and prepositions with four or fewer letters. (Needs improvement--see revised version, farther below)
proc title str {
    set output {} 
    set nocaps [list a an and at but by for from in into of on or the to with]
    foreach word [split $str] {
        if {[lsearch $nocaps $word] == -1} {
            set word [string totitle $word]
        }
        append output "$word "
    }
    return [string trim $output]
}

Example:
% set bogomips "groundwork of the metaphysics of balderdash"
groundwork of the metaphysics of balderdash
% title $bogomips
Groundwork of the Metaphysics of Balderdash

DKF: Cool! It's not quite right though. The following gets closer, but isn't right yet either (testing on your paragraph above, of course!)
proc title {str} {
    set output {} 
    set nocaps {a an and at but by for from in into of on or the to with}
    foreach word [regexp -all -inline {[\w'.]+|\W+} $str] {
        if {[string match {[A-Za-z]*} $word] && [lsearch $nocaps $word] == -1} {
            # In 8.5 should use the 'ni' operator instead of the [lsearch]
            set word [string totitle $word]
        }
        append output $word
    }
    return [string trim $output]
}

LES: You forget that you'll want those "nocaps" words to be turned to title if they begin a sentence. Here is a proc I use to rename files (usually mp3). It doesn't handle punctuation, but handles hyphens, e.g.: "Artist - A Song Title". Tweaking it to also take punctuation (periods or first word of title) in consideration should be trivial. That is not desired in file renaming because periods (dots) do not mean the same as in regular titles.
proc mp3 args {
    set _nocaps {
        a as à às ao aos de da das do dos e em na nas no nos 
        o os ou para por que sem sob
        an and are at but for from if in is it's not of on or 
        the to under vs vs. with without
        au aux avec dans des en et le la les ou par pour qui si
        con del el en la las los sin y
    }; # German and Italian, anyone?
 
    foreach _file [glob *] { 
        if { $_file eq {.:} } { continue }
        # lowercase the whole name
        set _old  $_file
        set _file [string tolower $_file]
        set _new {}              ;# empty _new in each iteration
        set _c   0               ;# zero counter in each iteration
 
        foreach _word $_file { 
            incr _c
 
            # anything right after " - " probably is the first word of a phrase.
            # if current word is "-", reset counter so next word is 1 and gets caps
            if {$_word eq {-}} {set _c 0}

            # if current word is 1, it gets caps
            if {$_c == 1} {set _word [string totitle $_word]}

            # if it is not in exceptions, it gets caps
            if {[lsearch $_nocaps $k ] < 0} { 
               set k [string totitle $k]
            }

            # add the word to the new name
            lappend _new $k
        }
 
        # UNCOMMENT the two next lines if you want files renamed automatically
        #file rename  $_old  RENAME_TEMP
        #file rename  RENAME_TEMP  $_new
        puts  "$_old\n$_new\n"
    }
}

D. McC: OK, I do want the "nocaps" words capitalized if they (1) begin the title or (2) come right after a colon. Also, as I belatedly noticed, quotation marks need to be stripped out for "string totitle" to work right, but then put back into the finished product. I'd go a long way to avoid a regular expression like {[\w'.]+|\W+}, though. Let's try this:
proc title str {
    set output {} 
    set nocaps [list a an and at but by for from in into of on or the to with]
    set count 0
    foreach word [split $str] {
        # Strip quotation marks:
        if {[string index $word 0] eq "\""} {
            set quote 1
            set word [string trim $word \"]
        } else {
            set quote 0
        }
        # Always capitalize the first word; otherwise,
        # don't capitalize any words in the "nocaps" list:
        if {$count == 0 || [lsearch $nocaps $word] == -1} {
            set word [string totitle $word]
        }
        # Add word plus space, with or without quotation marks, to output:
        if {$quote} {
            append output "\"$word\" "
        } else {
            append output "$word "
        }
        # Capitalize any word after a colon:
        if {[string index $word end] eq {:}} {
            set count 0
        } else {
            incr count
        }
    }
    return [string trim $output]
}

Example:
% set wordsworth {what I say is: by gum, give me the finest "bogomips" in the universe!}
what I say is: by gum, give me the finest "bogomips" in the universe!
% title $wordsworth
What I Say Is: By Gum, Give Me the Finest "Bogomips" in the Universe!
proc mystringFirst {substring mystring ocorrencia} {
    if {!$ocorrencia} {return -1}
    set index 0
  
    for {set i 0} {$i < $ocorrencia} {incr i} {
        set index [string first $substring $mystring]
        set mystring [string range $mystring [expr $index + 1] [string length $mystring]]
    }
  
    return $index
}

pw: Here's an example that doesn't use foreach.
proc title-case text {
    for {set index 0} {$index < [string length $text]} {incr index} {
        if { [string wordstart $text $index] == $index } {
            set text [string replace $text $index $index [string toupper [string index $text $index]]]
        } else {
            set text [string replace $text $index $index [string tolower [string index $text $index]]]
        }
    }
    return $text
}

Misc  edit

LV: Take a look at Tcl-FAQ's part 5 to gather ideas on other packages with string functions.

Letterspacing  edit

escargo:

spreading a string by inserting blanks between each two characters. KBK notes that there's a printer's proverb: Anyone who would l e t t e r s p a c e l o w e r c a s e would steal sheep. Simply functional:
proc letterspace s {join [split $s {}] { }} ;# RS
% letterspace {steal sheep}
s t e a l   s h e e p

In the book Stop Stealing Sheep & find out how type works, by Erik Spiekermann and E.M. Ginger, Adobe Press, 1993, on page 7, in the side bar the quotation attributed to Frederic Goudy was, "Anyone who would letterspace black letter would steal sheep."

Empty string test  edit

See: is_empty

Summary: the fastest way is [expr {![string length $str]].