set features {
    de {ä ö ü ß ß ch ck sch ei en ge ung w}
    en {th y sh ch in ing ed en ns rs w with from and}
    es {á é ión dad es ll ch qu j ya os as üe con ya}
    fr {é é ê è ch ei eu iè oi x y qu au ou de d' et ir te}
    is {á í ú ó ý é þ æ ð ö}
    ga {an á bh bh dh gc iai iai mb mh n- ó uai uai ú}
    it {è è ù da " e " re di il gl io ione ioni cc cch z tt una qu}
    ms {ah ak j jan ke ngan pu se uan ang ber ku}
    nl {ij ij ing tj sj z aa ee ei ou oe oe met ge sch te baar}
    pt {lhã lhõ lha lhe lh nha nhã nho nhõ nhe nt ndo de ão ões ss rr os as ch ça ção ções ico ns apro}
}# LES gets carried away and suggests:# pt {" de " " do " " da " " os " " as " " e " " que " " em " " nos " " nas " " na " "de " "te " " se " " às " " aos " " com " lhã lhõ lha lhe lh nha nhã nho nhõ nhe ndo ão õe ãe ssa sse ssi sso ssu rra rre rri rro rru ch ça ção ções ico}
# ''(removed: de nt os as ns apro ss ões)''The identification code itself is fairly straightforward:proc lang'identify {string features} {
    set res {}
    foreach {language regexps} $features {
        set score 0
        foreach regexp $regexps {
            incr score [regexp -all -nocase $regexp $string]
        }
        if $score {lappend res [list $language $score]}
    }
    set t [lsort -decreasing -integer -index 1 $res]
}But most important is the testing. I collected sample phrases from food packages, which are often pretty multilingual in Germany, and other sources. I kept adding sample strings, and tuning the above feature set until the target language came first most of the time (short strings can't always be identified right):KPV: If you need examples in various unusual languages you could use Google's advanced search page and specify different languages to search for.set ntests 0
set nproblems 0
set score 0
foreach {language string} {
    de {Dies ist ein Beispiel für einen deutschen Satz}
    de {Knusprige Weizenflocken mit Schokoladengeschmack}
    de {Waffeln mit feiner Haselnusscremefüllung}
    de {Vor dem Öffnen bitte schütteln}
    de {Trocken lagern, vor Licht schützen}
    de {Sicherheitsinformationen für Netzkabel und Zubehör}
    de {Auf dieses Produkt wird eine zwölfmonatige Garantie gegeben}
    en {This is an example for an English sentence}
    en {Wafers filled with hazelnut creme}
    en {Store in a dry place, protect from light}
    en {Safety precautions for power cords and accessories}
    en {This product is warranted for the period of twelve months}
    en {Worldwide telephone numbers}
    en {For continuous quality improvement, calls may be monitored}
    es {Él que no espera vencer, ya está vencido}
    es {Copos de trigo tostados con chocolate}
    es {Barquillos rellenos de crema de avellanas}
    es {Precauciones de seguridad para cables de alimentación y accesorios}
    es {Este producto está garantizado por un período de doce meses}
    es {Esta garantía no cubre ninguno de los siguientes casos}
    es {En caso necesario, la lista de nuestros Servicios Autorizados
        está disponible}
    fr {Voilà  un autre exemple pour une phrase francaise}
    fr {Gaufrettes fourrées à la noisette}
    fr {Agiter avant d'ouvrir}
    fr {Conserver au réfrigerateur une fois ouvert et consommer dans les
        jours qui suivent}
    fr {Protéger contre la lumière}
    fr {A consommer de préférence avant fin:}
    fr {Précautions de sécurité concernant les cordons d'alimentation}
    fr {Cet appareil est couvert par une garantie de douze mois}
    ga {Eolas an Chuairteora do Láithreáinin Oidhreachta}
    ga {Tá roinnt bríonna leis an bhfocal Dúchas}
    ga {Tugann na milliúin daione cuairt ar ár n-ionaid gach bliain}
    ga {Tá na hAmanna oscailte sa bhfoilseachán seo i gceart ag am priondála}
    ga {Ionad Cuairteora Pháirc an Fhionnuisce}
    ga {Tógadh an caisleán le linn na mblianta 1870-73}
    ga {Deirtear gur tógadh an foirgneamh is sine anseo 400 bliain ó shin}
    it {Questo è un altro esempio per una frase italiana}
    it {Fiocchi di frumento al cioccolato}
    it {Wafers ripieni di crema alla nocciola}
    it {Agitare prima di aprire}
    it {Da consumare preferibilmente entro il:}
    it {Una volta aperto tenere in frigo e consumare entro qualche giorno}
    it {Precauzioni relative alla sicurezza per i cavi di alimentazione}
    it {Questo prodotto è garantito per un periodo di dodici mesi}
    it {Se non è vero, è ben trovato}
    ms {Pendahuluan cetak yang dibarahui}
    ms {Dia pun hendak ikut saya ke kedai}
    ms {Orang itu pun membuat kerjanya dengan cepat}
    ms {Pukul sembilan setengah malam}
    ms {Jepun pun kalah dalam pertandingan bolasepak Piala Merdeka}
    ms {Meja ini baik, tetapi meja itu pun baik juga}
    ms {Cik Pun pun turut serta dalam pertandingan itu}
    nl {Het is tijd om op te staan, vandaag is het zaterdag}
    nl {Wafeltjes met hazelnootcrèmevulling}
    nl {Tegen licht beschermen. Tenminste houdbaar tot einde:}
    nl {Dit produkt is gegarandeerd voor een periode van twaalf maanden}
    nl {De garantie is alleen geldig wanneer de garantiekaart volledig is ingevuld}
    nl {In probleemgevallen kunt U nadere informatie verkrijgen}
    nl {Reparaties onder garantie moeten door servicecentra worden uitgevoerd}
    pt {Alguns quatrilhões de ítens de informação, formando amostragens de}
    pt {Essas pulsações eram armazenadas em mnemocircuitos idênticos}
    pt {Na praça da cidade, a fila tinha se formado às cinco da manhã com os}
    pt {sentido humorístico. Tinha um fim. Estava armado da lista telefônica.}
    pt {aproximar da Estação Ether e poderei aproveitar a caminhada para}
} {
    incr ntests
    set res [join [lang'identify $string $features]]
    if {[lindex $res 0] ne $language} {
        puts "$string\n   $res **** should have been: $language"
        incr nproblems
    } elseif {[llength $res]==2} {
        incr score [lindex $res 1]
    } elseif {[lindex $res 1] > [lindex $res 3]} {
        incr score [expr {[lindex $res 1]-[lindex $res 3]}]
    } else {
        puts "$string\n   $res **** ambiguous: $language"
        incr nproblems
    }
}
puts "score: [expr 30*$score/$ntests] Passed:[expr $ntests-$nproblems]/$ntests"GS (040612) Some hints to go further at the Gertjan van Noord web site [1].
slebetman The opening line of this common Malay rhyme fails the test and is identified as english with a score of {en 2} {nl 1}:
set teststring {dua tiga kucing berlari}adding "ber" and "ku" to the list of ms identifiers improves the match with a score of {en 2} {ms 2} {nl 1}. However the full rhyme passes the original test with a score of {ms 5} {en 4} {ga 4} {nl 2} {it 1}. Here's the full rhyme:set teststring {
    dua tiga kucin berlari,
    mana nak sama si kucing belang,
    dua tiga boleh ku cari,
    mana nak sama si adik seorang.
}The addition of "ber" and "ku" improves the result further with a score of {ms 9}. So I added them to the features list above.you can get utf-8 samples of a boatload of languages here: http://unicode.org/udhr/


