- Read the data into one or more lists
- Extract the data for a much shorter period of time
- Determine the minimum and maximum values
- Make a quick and dirty plot
- Write some data in a suitable format to another file
- ...
- I want to plot the data for station A and station B in the same plot. This could be done by code like:
set plot [create-a-plot]
$plot add data_a
$plot add data_b- I want to determine daily averages of the temperature:
set averages [average [group $temp 24]]- I want to compute the total amount of nitrogen (present in the chemical forms "Kjeldahl-N" and "nitrate" in the water body):
set totn [sum $kjdn $nitrate]- A more ambitious form: all the data are stored in the same table containing ammonia, kjeldahl-N, nitrate, ortho-phosphate and total-P, some typical water quality parameters:
set total_nutrients [construct $table totn {$kdn+$nitrate} totp {$totp}]- Give a statistical summary of the data in the table:
describe $table- Select a certain range of data (retain every 10th record):
set new_table [slice $table 0 1000 10]Ambitious? Perhaps.The main problem is not the structure in which to keep the data, the main problem is the collection of commands! I can think of at least four different underlying structures:- As a list or a list of lists - simple and straightforward
- As a matrix (from the Tcllib matrix module) - provides a flexible API for manipulating rows and columns
- As binary arrays - compact, easy to use by a compiled extension (vkit actually uses this approach)
- As tables in a database - Metakit could serve here very very well
- [dataset {list of names}] creates an empty table with columns that can be addressed by name
- [sum $data1 $data2] returns a new dataset whose columns consist of the sums of the two datasets (the second may also be a plain list). The number of columns and rows must match
- [filter $data1 {condition}] returns a new dataset of which the rows have been filtered through the condition
- [slice $data1 $start $stop ?step?] returns a new dataset which has only the rows that are requested
- [contents $data] returns summary information about the dataset (printable)
- [setnames data {list of names}] set the names of the columns of a dataset
- [getnames $data] returns the names of the columns of a dataset
- [getrow $data $row] returns a list with the values at the given row
- [addrow data $values] adds a new row of data at the end
- [print data] prints the contents of the dataset to screen
}
# Create a proper namespace for the data manipulations and
# declare the public data
#
namespace eval ::dml {
namespace export dataset sum filter slice contents setnames
namespace export getnames getrow addrow print
# Private namespace - for filter and others
namespace eval v {}
}
# dataset --
# Create a new empty data set
# Arguments:
# names List of names
# Result:
# A new dataset
#
proc ::dml::dataset {names} {
return [list $names {}]
}
# contents --
# Return a readable description of the dataset
# Arguments:
# dataset The dataset in question
# Result:
# String describing the dataset
#
proc ::dml::contents {dataset} {
set string "Columns in the dataset: [join [getnames $dataset] ", "]\n\
Number of rows: [llength [lindex $dataset 1]]"
return $string
}
# getnames --
# Get the column names of an existing dataset
# Arguments:
# dataset The dataset to be examined
# Result:
# List of column names
#
proc ::dml::getnames {dataset} {
return [lindex $dataset 0]
}
# setnames --
# Set the column names of an existing dataset
# Arguments:
# dataset The dataset to be examined
# newnames The new names for the columns - number must match
# Result:
# None
#
proc ::dml::setnames {dataset newnames} {
upvar $dataset theset
set names [lindex $theset 0]
if { [[length $names] != [[length $newnames] } {
return -code error "Number of names does not match the number of columns"
}
lset theset 0 $newnames
}
# addrow --
# Add a row of data to an existing dataset
# Arguments:
# dataset Name of the dataset
# values The row to be added
# Result:
# None
# Note:
# The number of values must match the number of columns
#
proc ::dml::addrow {dataset values} {
upvar $dataset theset
set names [lindex $theset 0]
if { [llength $names] != [llength $values] } {
return -code error "Number of values does not match the number of columns"
}
set data [lindex $theset 1]
lappend data $values
lset theset 1 $data
}
# getrow --
# Get a row of data to an existing dataset
# Arguments:
# dataset The dataset to be examined
# row The index of the row to be returned
# Result:
# None
# Note:
# The number of values must match the number of columns
#
proc ::dml::getrow {dataset row} {
return [lindex [lindex $dataset 1] $row]
}
# slice --
# Select the rows of a new dataset by stepping through an existing
# Arguments:
# dataset Name of the dataset
# start The first row to be added
# stop The last row to be added
# step The step size for stepping through the rows (optional)
# Result:
# New dataset
#
proc ::dml::slice {dataset start stop {step 1}} {
set names [lindex $dataset 0]
set data [lindex $dataset 1]
if { $step <= 0 } {
return -code error "Step size must be positive"
}
set newset [dataset $names]
set row $start
while { $row <= $stop } {
addrow newset [getrow $dataset $row]
incr row $step
}
return $newset
}
# sum --
# Sum two datasets column by column
# Arguments:
# data1 The first dataset
# data2 The second dataset or a plain list
# Result:
# New dataset
#
proc ::dml::sum {data1 data2} {
#
# Determine the number of columns first
# Tricky, though
#
set names1 [lindex $data1 0]
set values1 [lindex $data1 1]
set norows1 [llength $values1]
if { [llength $data2] != 2 } {
set norows2 1
set names2 $data2
set values2 $data2
} else {
#
# Note: this logic is _not_ complete!
# It fails for a proper dataset with 1 column and 1 row
#
if { [llength [lindex $data2 1]] == 1 } {
set norows2 1
set names2 $data2
set values2 $data2
} else {
set names2 [lindex $data2 0]
set values2 [lindex $data2 1]
set norows2 [llength $values2]
}
}
if { $norows1 != $norows2 && $norows2 != 1 } {
return -code error "Numbers of rows do not match"
}
if { [llength $names1] != [llength $names2] } {
return -code error "Numbers of columns do not match"
}
set newset [dataset $names1]
if { $norows2 > 1 } {
foreach row1 $values1 row2 $values2 {
set newrow {}
foreach c1 $row1 c2 $row2 {
lappend newrow [expr {$c1+$c2}]
}
addrow newset $newrow
}
} else {
foreach row1 $values1 {
set newrow {}
foreach c1 $row1 c2 $values2 {
lappend newrow [expr {$c1+$c2}]
}
addrow newset $newrow
}
}
return $newset
}
# filter --
# Filter the rows of a dataset
# Arguments:
# dataset The dataset to be filtered
# condition The condition for keeping a row
# Result:
# New dataset
# Note:
# To avoid possible conflicts between local
# variables and the column names, we use a
# private namespace for the local variables
#
proc ::dml::filter {dataset condition} {
set v::names [lindex $dataset 0]
set v::newset [dataset $v::names]
set v::values [lindex $dataset 1]
set v::cond $condition
foreach v::row $v::values {
foreach $v::names $v::row {break}
if $v::cond {
addrow v::newset $v::row
}
}
return $v::newset
}
# print --
# Print the contents of a dataset to stdout
# Arguments:
# dataset The dataset to be printed
# Result:
# None
# Side effect:
# Contents shown on screen
#
proc ::dml::print {dataset} {
set row 0
puts "Columns: [join [getnames $dataset] \t]"
while {1} {
set values [getrow $dataset $row]
if { [llength $values] == 0 } {
break
}
puts "$row: [join $values \t]"
incr row
}
}
if {0} {
Let us test this code:
}
namespace import ::dml::*
set table [dataset {A B}]
#
# Hm, candidate for a new command?
#
for {set i 0} {$i < 10} {incr i} {
addrow table [list [expr {$i+1}] [expr {2*$i}]]
}
puts "Contents: [contents $table]"
puts "Columns: [getnames $table]"
puts "Records with A > 2"
set newtable [filter $table {$A>2}]
print $newtable
puts "Records with A > 2 and B < 10"
set newtable [filter $table {$A>2 && $B<10}]
print $newtable
puts "Summed table:"
print [sum $table $table]
puts "Slice of a table:"
print [slice $table 2 4]if {0} {A few remarks about the above code are appropriate:
- There is far less error checking than needed in a truly useful package, especially if it is to be used in a semi-interactive way.
- The above code ignores the possibility of missing values, most naturally represented as an empty stirng or list ("" or {}).
- The command [filter] does not deal with parameters (that is, a condition like {$A>$threshold}, where "threshold" is a local variable. One naive way of dealing with such conditions would be to have the user use a global or namespace variable instead, like: {$A>$::threshold}.
- With the current set of commands you can not manipulate columns.
[ Category Essay | Category Numerical Analysis ]

