wiki-reaper

Michael A. Cleverly 2002-11-19: my browser seems to not be able to copy and paste more than 4,096 characters at a time, which makes copying & pasting large chunks of code from the Wiki difficult. Here is An HTTP robot in Tcl that will fetch page(s) from the Wiki and spits out just the stuff between <pre></pre> tags (which, generally speaking, is just Tcl code). It also trims the first leading space that the Wiki requires.

Usage edit

First you have to get the above code into a file somehow. You have to start somewhere ;-) . So somehow save this page into a file called "wiki-reaper", and edit the contents to remove comments, etc.
Make certain that the file is going to be found when you attempt to run it. On Unix like systems, that involves putting the file into one of the directories in $PATH.
wiki-reaper 4718 causes wiki-reaper to fetch itself... :-)

Previous Revisions edit

edit 34, replaced 2014-12-30: the original code by Michael A. Cleverly

Code edit

#!/usr/bin/env tclsh
package require Tcl 8.5
package require cmdline
package require http
package require textutil

namespace eval wiki-reaper {
    variable version {4.0.0 (2017-11-17)}
    variable protocol https
    variable useCurl 0
    variable hostname wiki.tcl-lang.org

    set curl [expr {![catch {exec curl --version}]}]
    set tls  [expr {![catch {
        package require tls
        http::register https 443 [list ::tls::socket -tls1 1]
    }]}]
    if {!$tls} {
        if {$curl} {
            set useCurl 1
        } else {
            set protocol http
        }
    }
    unset curl tls
}

proc ::wiki-reaper::output args {
    set ch stdout
    switch -exact -- [llength $args] {
        1 { lassign $args data }
        2 { lassign $args ch data}
        default {
            error {wrong # args: should be "output ?channelId? string"}
        }
    }
    # Don't throw an error if $ch is closed halfway through.
    catch { puts $ch $data }
}

proc ::wiki-reaper::fetch url {
    variable useCurl
    # The cookie is necessary when you want to retrieve page history.
    set cookie wikit_e=wiki-reaper
    if {$useCurl} {
        set data [exec curl -s -f -b $cookie $url]
    } else {
        set connection [::http::geturl $url -headers [list Cookie $cookie]]
        set data [::http::data $connection]
        ::http::cleanup $connection
    }
    return $data
}

proc ::wiki-reaper::parse-history-page {html pattern} {
    set needle [join [list \
        {<td class=["']Rev["']><a href=["'][^"']+["'] rel=["']nofollow["']>} \
        $pattern \
        {</a></td><td class=["']Date["']>([0-9\-\: ]+)</td>} \
    ] {}]
    set success [regexp -nocase $needle $html _ revision date]
    if {!$success} {
        error "couldn't parse the revision or the date."
    }
    return [list $revision $date]
}

proc ::wiki-reaper::reap {page block {revision "latest"} {flags ""}} {
    variable protocol
    variable hostname

    set latestHistoryUrl "$protocol://${hostname}/_/history?N=\$page&S=0&L=1"
    set revisionHistoryUrl "protocol://${hostname}/_/history?N=\$page&S=\[expr \
            {\$latestRevision - \$revision}\]&L=1"
    set pageUrl "$protocol://${hostname}/_/revision?N=\$page&V=\$revision"
    set codeUrl "$protocol://${hostname}/_/revision?N=\$page.code&V=\$revision"
    set now [clock format [clock seconds] -format "%Y-%m-%d %H:%M:%S" -gmt 1]

    if {$revision eq ""} {
        set revision "latest"
    }

    set latestHistory [fetch [subst $latestHistoryUrl]]

    if {![regexp -nocase \
            {<title>Change history of ([^<]*)</title>} \
            $latestHistory _ title]} {
        error "couldn't parse the document title."
    }
    lassign [parse-history-page $latestHistory {([0-9]+)}] \
            latestRevision latestUpdated

    if {$revision eq "latest"} {
        set revision $latestRevision
        set updated $latestUpdated
    } else {
        if {![string is integer -strict $revision] ||
                ($revision < 0) ||
                ($revision > $latestRevision)} {
            error "no revision $revision ($latestRevision latest)"
        }
        set revHistory [fetch [subst $revisionHistoryUrl]]
        lassign [parse-history-page $revHistory "($revision)"] revision updated
    }

    set code [fetch [subst $codeUrl]]

    if {[regexp -nocase "<body>.?<h2>[subst $codeUrl] Not Found</h2>" \
            $code _]} {
        error "wiki page $page does not exist"
    }

    if {$block ne ""} {
        set codeBlocks [::textutil::splitx $code \
                {\n#+ <code_block id=[0-9]+ title='[^>]*?'> #*\n}]
        set code [lindex $codeBlocks [expr {$block + 1}]]
    }

    if {[dict get $flags "x"]} {
        output "#! /usr/bin/env tclsh"
    }
    output "#####"
    output "#"
    output "# \"$title\" ([subst $pageUrl])"
    if {$block ne ""} {
        output "# Code block $block"
    }
    output "#"
    output "# Wiki page revision $revision, updated: $updated GMT"
    output "# Tcl code harvested on: $now GMT"
    output "#"
    output "#####"
    output $code
    output "# EOF"
}

proc ::wiki-reaper::main {argv} {
    variable protocol
    variable version

    set options {
        {f "Allow downloading over HTTP instead of HTTPS"}
        {x "Output '#!/usr/bin/env tclsh' as the first line"}
        {v "Print version and exit"}
    }
    set usage "?options? page ?codeBlock? ?revision?"
    if {$argv in {-h -help --help -?}} {
        output stderr [::cmdline::usage $options $usage]
        exit 0
    }
    if {[catch {
        set flags [::cmdline::getoptions argv $options $usage]
    } err]} {
        output stderr $err
        exit 1
    }
    if {$protocol eq "http"} {
        if {[dict get $flags "f"]} {
            output stderr {Warning! Can't use cURL or TclTLS; connecting over\
                           insecure HTTP.}
        } else {
            output stderr {Can't use cURL or TclTLS; refusing to connect over\
                           insecure HTTP without the "-f" flag.}
            exit 1
        }
    }
    lassign $argv page block revision

    if {[dict get $flags "v"]} {
        output $version
        exit 0
    }
    if {$page eq ""} {
        output stderr [::cmdline::usage $options $usage]
        exit 0
    }

    reap $page $block $revision $flags
}

proc ::wiki-reaper::main-script? {} {
    # From https://tcl.wiki/40097.
    global argv0
    if {[info exists argv0]
     && [file exists [info script]] && [file exists $argv0]} {
        file stat $argv0        argv0Info
        file stat [info script] scriptInfo
        expr {$argv0Info(dev) == $scriptInfo(dev)
           && $argv0Info(ino) == $scriptInfo(ino)}
    } else {
        return 0
    }
}

if {[::wiki-reaper::main-script?]} {
    ::wiki-reaper::main $argv
}
#EOF

Bootstrapping

You wouldn't want to copy-paste this to file manually, now would you?

fr Why not? lam suggested a tiny javascript that allows copy and paste - click on code to select. Please check out this page cloned at wiki-reaper

On *nix

Here's a one-liner to save the wiki-reaper script to wiki-reaper.tcl in the current directory. It uses awk and a special marker in the code to make sure other code blocks on this wiki page don't interfere.

curl https://wiki.tcl-lang.org/4718.code

| awk 'BEGIN{write=0} /env tclsh/{if(write==0){write=1}} /#EOF/{write=-1} {if(write==1){print $0}}' > wiki-reaper.tcl && chmod 0755 wiki-reaper.tcl

If you want to install wiki-reaper to /usr/local/bin on your machine instead run

curl https://wiki.tcl-lang.org/4718.code

| awk 'BEGIN{write=0} /env tclsh/{if(write==0){write=1}} /#EOF/{write=-1} {if(write==1){print $0}}' | sudo tee /usr/local/bin/wiki-reaper && sudo chmod +x /usr/local/bin/wiki-reaper

On Windows

TBD

Security edit

If it can't find cURL on your system, wiki-reaper downloads code from the wiki over plain HTTP that is vulnerable to a [MITM] attack, meaning a hostile network node can replace the code you want with something malicious. Moreover, anyone can edit the wiki, so the code may change between when you look it and when you download it. Be sure to inspect the code you fetch with wiki-reaper before you run it.

Discussion edit

jcw 2002-11-22:

This could be the start of something more, maybe...

I've been thinking about how to make the wiki work for securely re-usable snippets of script code. Right now, doing a copy-and-paste is tedious (the above solves that), but also risky: what if someone decides to play tricks and hide some nasty chage in there. That prospect is enough to make it quite tricky to re-use any substantial pieces, other than after careful review - or simply as inspiration for re-writing things.

Can we do better? Maybe we could. What if a "wiki snippet repository" were added to this site - here's a bit of thinking-out-loud:

if verbatim text (text in <pre>...</pre> form) starts off with a certain marker, it gets recognized as being a "snippet"
snippets are stored in a separate read-only area, and remain forever accessible, even if the page changes subsequently
the main trick is that snippets get stored on basis of their MD5 sum
each snippet also includes: the wiki page#, the IP of the submitter, timestamp, and a tag
the tag is extracted from the special marker that introduces a snippet, it's a "name" for the snippet, to help deal with multiple snippets on a page

Now where does this all lead to? Well, it's rough thinking, but here's a couple of comments about it:

if you have an MD5, you can retrieve a snippet, without risk of it being tampered with, by an url, say http://mini.net/wikisnippet/<this-is-the-32-character-md5-in-hex>
the IP stored with it is the IP of the person making the change, and creating the snippet in the first place, so it is a reliable indicator of the source of the snippet
if you edit a page and don't touch snippet contents, nothing happens to them
if you do alter one, it gets a new MD5 and other info, and gets stored as a new snippet
if you delete one, it stops being on the page, but the old one is retrievable as before

Does this mean all authentication is solved? No. It becomes a separate issue to manage snippet-md5's, but what the author needs to do is pick a way to get that to others. I could imagine that in basic use, authors maintain an annotated list of snippets on their page - but frankly, this becomes a matter of key management. How do you tell someone that you have something for them if the channel cannot be trusted? Use a different channel: chat, email, a secured (https/ssl) website, whatever.

This approach would not solve everything. But what it would do is that *if* I have a snippet reference, and I trust it, then I can get the contents at any time. Snippets will have the same property as wiki pages that they never break, but with the added property that they never change.

On top of that, who knows... a catalog? With user logins, authentication, pgp, anything can be done. A tool to grab the snippet given its md5, and a tool to locate snippets based on simple tag or content searches, it's all pretty trivial to do.

Is this a worthwhile avenue to explore further? You tell me... :o)

SB 2002-11-23: If you for a minute forget about the validation of code integrity and think about the possibility to modify program code independent of location, then it sounds like a very good idea. An example is to show progress of coding. The start is a very simple example code, then the example is slightly modified to show how the program can be improved. With this scheme, every improvement of code can be backtracked to the very beginning, and, hence, work as a tutorial for new programmers. If we then think about trust again, there are too many options for code fraud that I do not know.

escargo 2002-11-23: I have to point out that the IP address of the source is subject to a bunch of qualifications. Leaving out the possibility of the IP address being spoofed, I get different IP addresses because of the different locations I use to connect to the wiki; with subnet masking it's entirely possible that my IP addresses could look very different at different times even when I am connected from the same system.

Aside from that issue, could such a scheme be made to work well with a version of the unknown proc and mounting the wiki, or part of the wiki, through VFS? This gets back to the TIP dealing with metadata for a repository.

This in turn leads me to wonder, how much of a change would it be to add a page template capability to the wiki? In practice now, when we create a new page, it is always the same kind of page. What if there was a policy change that allowed for creating each new page selected from a specific set of types of pages. The new snippet page would be one of those types. Each new page would have metadata associated with it. Instead of editing pages always in a text box, maybe there would be a generated form. Is that possible? How hard would it be? This could lead from a pure wiki to a web-based application, but I don't know if that is a bad thing or not. Just a thought. (Tidied up 5 May 2003 by escargo.)

LV 2003-05-05: with regards to the snippet ideas above, I wonder if, with the addition of CSS support here on the wiki, some sort of specialized marking would not only enable snipping code, but would also enable some sort of special display as well - perhaps color coding to distinguish proc names from variables from data from comments, etc.

CJU 2004-03-07: In order to do that, you would need to add quite a bit of extra markup to the HTML. I once saw somewhere that one of the unwritten "rules" of wikit development was that preformatted text should always be rendered untouched from the original wiki source (with the exception of links for URLs). I don't particularly agree with it, but as long as it's there, I'm not inclined to believe that the developer(s) are willing to change.

Now, straying away from your comment for a bit, I would rather have each preformatted text block contain a link to the plaintext within that block. This reaping is an entertaining exercise, but it's really just a work-around for the fact that getting just the code out of an HTML page is inconvenient for some people. I came to this conclusion when I saw a person suggest that all reapable pages on the wiki should have hidden markup so that the reaper could recognize whether the page was reapable or not. To me, it's a big red flag when you're talking about manually editing hundreds or thousands of pages to get capability that should be more or less automatic.

I'm looking at toying around with wikit in the near future, so I'll add this to my list of planned hacks.

LV 2007-10-08:

Well, I changed the mini.net reference to tcl.wiki. But there is a bug that results in punctuation being encoded. I don't know why that wasn't a problem before. But I changed one string map into a call to ::htmlparse::mapEscapes to take care of of the problem.

tb 2009-06-16

Hm... - I still get unmapped escape sequences, when reaping from this page, using kbskit-8.6. I don't get them, when reaping from a running wikit. Am I missing something?

LV 2009-06-17 07:37:08:

Is anyone still using this program? Do any of the wiki's enhancements from the past year or two provide a way to make this type of program easier?

jdc 2009-06-17 08:34:23:

Fetching <pagenumber>.txt will get you the Wiki markup. Best start from there when you want to parse the wiki pages yourself. Another option is to fetch <pagenumber>.code to only get the code blocks. Or use TWiG.

dbohdan 2014-12-30: The code could not retrieve anything when I tried it, so I updated it to work with the wiki as it is today. It now uses <pagenumber>.code, which jdc has mentioned above. Other changes:

Do not use nstcl-http or exec magic.
Include page URL in the output.
Format date of retrieval consistently with dates on the wiki.
Can retrieve a specific code block from the page if you tell it to. The usage is now wiki-reaper page ?codeBlock?.
Can no longer fetch more than one page at a time due to the above.

Note that I have replaced the original code with my update. If someone wants to preserve the original code on the page I can include mine separately.

PYK 2014-12-30: It's nice to see older code on the wiki get some maintenance. I like the idea of a Previous Revisions section like the one I added above to record bigger changes.

dbohdan 2014-12-30: I like it as well. The page looks much nicer in general after your last edit.

I'm thinking of outputting the current revision number of the page being reaped along with the time when it was last updated. Is there any way to get the current revision number for a page without having to log in? Faking a log in to access page history seems excessive for such a little script.

dbohdan 2014-12-30: Never mind--it's a simple matter of a plain-text cookie. I've added it in the new version of the script.

dbohdan 2015-01-21: Version 2.2.0 includes revision support, i.e., you can get the content of any code block on a page at any given revision number. Edit: updated to save the wiki some traffic.

EF 2016-08-01: Version 2.2.1 includes a new hostname variable pointing to the Web server at which to find the wiki, as tcl.wiki was not working anymore.

stevel tcl.wiki is a backup in case the .tk domain gets hijacked again. It redirects to wiki.tcl.tk.

dbohdan 2017-08-14: wiki-reaper broke when the wiki switched from single to double quotes in the HTML markup on the history page. Version 2.6.2 fixes that.