Updated 2014-04-11 15:49:52 by EMJ

The wiki had some weird trouble with dates and references on Sat 12 Feb 2005, and again on the 13th 2005.

I've made the wiki read-only until the cause of this is resolved. Will look into this later today, my apologies for the inconvenience.

-jcw, your friendly wiki admin

2005-02-13 12:00 GMT - I've re-enabled the wiki with more patient locking, looks like some edits are now taking longer than the previous rules for lock acquire/break were prepared to wait for. There may be a race condition in the lockfile logic. If things break again, I'll revert to r/o mode again. Please let me know by email -jcw

Eureka! There is indeed a race condition, when wikit cannot acquire the lock. In wikit/lock.tcl, it does:

  1. check if pid stored in lockfile is a running process
  2. if not, remove the lock file as being stale

But that fails when the process was quitting, removed its lock, and a new one already created a lock - all between steps 1) and 2).

This is quite probable when the cache has been cleared and a search engine is hitting the wiki, generating tons of CGI calls to rebuild each cache page it requests. As is the case this very moment, btw...

The good news is that no page edits get lost, I can easily rebuild/recover. It's just two processes opening the db in r/w mode, which is a no-no with Metakit. Due to some substantial column-wise caching, the effects can be very odd - as the recent two failures have shown.

Yuck. I'll tweak the lock logic a bit to prevent, or at least greatly reduce, this race condition. -jcw

Here's the new locking logic in wikit/lock.tcl:
  proc AcquireLock {lockFile {maxAge 3600}} {
    for {set i 0} {$i < 300} {incr i} {
      catch {
        set t [file mtime $lockFile]
        set fd [open $lockFile]
        set opid [gets $fd]
        close $fd

        if {[clock seconds] > $t + $maxAge ||
            $opid != "" && [file isdir /proc] && ![file isdir /proc/$opid]} {

          # If the lock looks stale, wait a bit to see if it is about to go away
          # and be reclaimed by another process - if so, avoid a file delete race.
          # This caused a damaged db twice in mid-Feb 2005, the new logic should
          # make most wikit instances back off if they see a lock.

          after 2500

          if {[file mtime $lockFile] == $t} {

            # here, the lock needs to be deleted *and* no other process has done
            # done so for 2.5s - so we *assume* it can now be deleted without race.

            file delete $lockFile
            set fd [open savelog.txt a]
            set now [clock format [clock seconds]]
            puts $fd "# $now drop lock $opid -> [pid]"
            close $fd
          }
        }
      }
      catch {close $fd}

      if {![catch {open $lockFile {CREAT EXCL WRONLY}} fd]} {
        puts $fd [pid]
        close $fd
        return 1
      }
      after 1100
    }
    return 0
  }

More cleanup - it turns out that a large number of requests from search-engine spiders are bogus. I've added a shell filter in front of the CGI calls which rejects all URLs that do not match the following patterns:
  shopt -s extglob
  case "$REQUEST_URI" in
    ?(/tcl)/+([0-9])?(.txt|.html)) ;;
    ?(/tcl)/[^0-9/@!]+([^/@!]))    ;;
    ?(/tcl)/edit/+([0-9])\@)       ;;
    ?(/tcl)/references/+([0-9])\!) ;;
    ?(/tcl)/2\?+([^/@!]))          ;;
    /cgi-bin/nph-wikit/+([0-9]))   ;;
    *)                             cat <<'EOF'
  HTTP/1.0 400 Bad request
  Content-type: text/plain

  This is not a valid URL for this site
  EOF
                                   exit;;
  esac
  echo HTTP/1.0 200 OK

It leads to a 5-fold reduction in requests triggering CGI. Please let me know if I accidentally locked out any valid requests -jcw

14feb05 jcw - Made various tweaks and fixes to /edit/ and /references/ links. The Recent Changes page now omits the link if the page has essentially no content. As far as I can tell, the changes over the past day or so to wikit bring everything back in working order, make URLs stricter, and prevent incorrect requests from triggering a CGI call to wikit (about 80% fewer requests now).

If any further problems show up, please email me [1] -jcw