Why Do Programs Take Up So Much Memory ?

Page started by Theo Verelst

Apart from having been called a fruitcake once (har har, what a drag), I find most things going on here serious, so maybe the weight of this page can be carried on something called as light a 'wiki', considering it is about a programming language which to some degree is also better off with this consideration (read also: Verelst is pissed about a lot of things, and thinks he can affect the world by contentwise discussions, even when the are rants).

Seriously. Long ago, when computers started to get a physical place in the world, memories, and communication rates, and computational resources were small. Yet, networked, useful, multitasking machines, with serious price tag, of course, were made which could do most things modern computers can do: multiprocess, communicate over various channels, run databases, run os-es with powerful commands, be entertaining, etc.

By progress almost solemnly coming from the advance in electrical engineering, discs got faster and bigger, as did processors, networks and main memories.

Then computers started to get more popular outside the heavy business and research area, probably because chip and board manufacturers, and the design community, found ways and used them to get the price down, and because technology evolved to make then do fun enough things, especially to add some graphics to the machines.

When graphics started to get in the picture, memory and a lot of other things got strained. I remember vividly the advent of X on some (at the time university) computers, and the quadric growth of required frame buffer memory as a function of the image size. That happens to have come from the same subject Tcl was related to: chip design, where large screens were useful to represent the content of a chip.

A bit later, personal computing became different from lets say apples and trs80's and derived architectures based on the same microprocessors (or quality improvements such as the beeb) when the processors, memory and boards got into the lets say Mega ballgame, several megahertz clock frequency, almost a megapixal on the screen, a megabyte per second networks, and several tens of megabytes of hard discs.

No mpeg yet, but more fun: nice images, graphical user interfaces, lot more powerful OS then on most micros.

And a lot of programming: to get all those menus, pictures, dialog boxes, windows and application interfaces up and running from about a meg of core memory (early PC's, Atari ST, of course the classic Mac and later ones).

More professionally speaking, quickly forgetting gks and such, one had Unix, VMS, ibm stuff (I guess, didn't use it), and what as interestingly been named X. With a lot of libraries to handle large frame buffers of all variations, lightweight windows, dots, lines and rectangle based drawing, blitting, windows handlers and what more.

And suddenly: neat little 'hello world'-like programs taking no more than possibly even a few tens of kilobytes of stdio library (assuming that is statically linked) and one programming complexity step higher started to have a ground level of a megabyte just for simple executables. Which is not funny with an OS based on little programs for many separate tasks (call then reentrant, parallel, communicating objects), so disc and memory sizes became a major issue.

And a lot of programming was concerned with making good use of resources, such a screen redraw algorithm, which make up for a lot of application programming unless you have a good way to shield that of.

At least on the systems running that graphics 'shell', memory managements was mostly to some degree advanced: dynamic allocation OS supported, and the hardware would support paging methods. Still, showing 5 big pictures on a screen forces you to somewhere store the image buffer data for all of them at times, unless you can programwise 'construct' the content, or unless you can assume they will never be completely visible all at the same time.

On the PC side of things, 'window redraw' was one of the complex issues to program.

Currently, main memories usually are big enough to store a number of full size frame buffers easily enough, which makes all that a lot easier, potentially.

I was recently playing with the graphical programming tool bwise, simple but useful Database routines for in core and tclhttpd combined in a single XP running wish84, about which I'll do a page [Web server with bwise graphical programming and database] when I find the opportunity. That setup, made from whatisit, maybe a few hundred kilobytes of combined scripts recorded about 8 megabytes in the task management window.

I do have to mention bwise can take pictures on its canvas, and the database has the capability to display records which refer to pictures, which I used to comment on a number of recent digital photographs. I made the pictures automatically smaller to fit in the dbase presentation window, and then again smaller to be shown as a little block on the bwise canvas, but the original size would be about a meg square in high quality jpeg (half a megabyte), which has to get read in before the resizing (at least as I did it now).

I just checked: the executables of wish84 are very small, and the basic dll's for tcl and tk together about 1.8 Mb. Starting wish or tkcon makes the system (XP sp 1 in this case, I'll try linux, too) report about 5 megabytes memory use.

I think it would be a lot better if some memory could be given back to the system, I remember having usd 16 megabytes at some point, especially after putting a about 1000 square pixel image on the canvas.. Not that that is necessarily problematic, but it is a good illustration of not very sparing memory policies in action.

DKF 26-8-2003 - Some versions of the memory allocator do hand memory back to the OS. It's not trivial to do in general though since the OS only allocates in terms of whole memory pages, not individual bytes. Some time I must find some time to make the general version of the Tcl_Obj hand-back allocator work properly...

JMN 2003-08-26 From what I can tell, there seems to be an attitude in the TCL community with regards to TCLs apparent retention of memory, that it doesn't really matter - or that it's too hard to give the memory back to the OS and besides other scripting langs such as Perl have the same sorts of issues... or something like that. Maybe these really are valid reasons and it is the OS fault more than TCL - but all I know is that when I run lots of TCL scripts there comes a point where applications crash and misbehave and it's all very frustrating because I know all that memory can't really be being used and I wish the whole 'high-watermark' style of holding onto memory would just disappear. I vaguely recall seeing something about there being the possibility of choosing different memory allocators for use with TCL - but as someone who doesn't delve into the C side of TCL I wouldn't know how useful this is. Then again - right now my browser alone appears to be using over 195MB of ram - so nowadays I couldn't feel comfortable just picking on TCL as a ram hog - it's apparently just the way of things.

Ok.. so while typing the above, DKF has disproved my sweeping statements about 'attitude' of the TCL community to the situation ;) Good to see that at least someone cares.. even if improvements in the area are a long way off.

DKF 26-8-2003: A bit more about why we don't hand memory back to the OS. The key thing is that it is hard to implement well due to things like fragmentation. Determining whether a memory block is free (you really don't want to deallocate something still in use) is quite expensive too. Given that for most apps (not all, I'll accept), the max amount of memory used is fairly closely related to the average, it's not a big enough win given how complex it is to debug and tune.

I also think that there is a libc that comes with this sort of support built-in. I think it might be the glibc used on recent Linuxes, but couldn't swear to it. I also suspect that Windows can also do that sort of thing for you, but the last time this discussion happened on tcl-core I didn't pay much attention.

Perhaps all you need to do is analyze your own apps for memory leaks (scripts can leak even if the core doesn't!) and switch to a suitable version of the core and a suitable libc/OS. We're quite careful with core memory leaks, and Tcl is much better than Tk in this particular area.

TV Without getting into the whole discussion above at this point, what I remember from older good enough unix systems with respect to memory management, you'd have malloc and free of course, the one gives a chunk of memory which is guaranteed to be contiguous. and free gives back that whole chunk to the OS to reuse for whatever, I mean after free is open to multi gb/s blit blast of rewriting for all the os cares. The idea of the contiguousness of course being that that is easy to program on. The use would be to make sure you get enough of a big chunk to do sensible processing on without running the machine into direct virtual memory problems, so you'd preferably allocate a significant but not too large fraction of the main memory, which you are sure you are going to make good use of.

The paging mechanisms on such machines, completely unlike the early PC like machines can in principle allow the system to fragment the whole memory, for instance into pages or in to segments, which then can be mapped to actual physical memory pages by the pager, in principle in any order and permutation, so that for instance on page basis you'd have complete freedom to swap in all pages you want anywhere in the actual memory banks, without the programmer knowing it (unless something goes wrong), and given you may want to take programming habits (page with programmer page number adjacency will most likely be needed soon) into account. Hardware units would do page owner management (out of process detection) and virtual page handling, so when one access various pages not all too often, not much speed penalty is paid.

Malloc-ing more often makes the OS give virtual memory ranges anywhere in the theoretical range, with no foreknowable place in physical memory, though probably at malloc time, all involved pages wouldn't start being swapped out to disc.

So as long as all programs don't use more than physical memory, nothing gets swapped or slow, and paging with reordering can be transparent to the programmer, so that he doesn't need to think in a limited heap/stack and/or relocating way.

Completely unlike a simplistic model I programmed in, too, where you'd have a process with heap and stack in opposing direction, and malloc somewhere in free space, limited in size to whatever memory fragmentation dictates. Then one at startup (the biggest program first) takes a big piece of linear memory, and hopes for the best.

I know the pentium has a page/segment handling capability with somewhat variable page size, though I must admit I don't recall/know what it can do exactly, apart from basic process memory space protection at least used in Linux.

Let alone that I'm aware of what the various not(mega)not(hard) OS version do in these departments, except I do know decent program startup planning makes a lot of difference in process swap times, and that Linux, too, eats memory and doesn't seem to return much for the various screen management programs I've worked with. And for instance redhat 8 doesn't exactly remind me of a one floppy profi os in that department. Then again a heavy price tag mini didn't support all too many high quality X servers, either.

I remember doing a extra malloc layer for *small* memory chunks, to prevent memory block allocation data being larger than the memory block size or being near that size. That is also relevant on unix.

DKF: Some libc implementations have calls to allow for tuning of malloc in this way. We don't bother in Tcl because the cost of maintaining platform-specific hacks (these things tend to vary even between UNIXes) is rather high; if someone wants to squeeze things for a particular deployment, they can always embed Tcl in an app that sets those things up.

elfring 26 Aug 2003 Can any links to [memory allocation] libraries help? [1]

Will the library "Hoard Memory Allocator" [2] be used for Tcl in the near future?

DKF: Not by default, though you can link it in yourself. It fails in three ways: 1) LGPL license, 2) written in C++, 3) does not even approach the style guidelines. First problem is not necessarily a show-stopper and certainly lets you use it as a replacement malloc(), but the second and third problems are major and cannot be resolved in short-order without significant effort. Do you want to do a comparative analysis with the memory allocator that was contributed to Tcl from the AOLserver people?

Is the following software in use to improve TCL code quality?

http://www.garret.ru/~knizhnik/jlint/ReadMe.htm#introduction (DKF: Mostly a Java tool. The code that handles C is not likely to be useful if the Tcl style-guide is followed.)
http://www.splint.org/, http://sourceforge.net/projects/splint/
http://www.kessler.de/prd/gimpel/p_pclint.htm (DKF: Unable to assess without translation DE->EN.)
http://www.lambdacs.com/debugger/USENIX/Debugger_USENIX_2003.html (DKF: Technique unlikely to scale to significant programs.)
http://debugger.netbeans.org/ (DKF: Java debugger. Does not appear to be relevant)
http://www.eclipse.org/ (DKF: Which bit did you have in mind?)
http://www.gnu.org/search/fsd-search.py?q=debug (DKF: Please indicate which of the search results are relevant.)
http://sourceforge.net/tracker/?func=detail&aid=667010&group_id=1&atid=350001 (DKF: That page is a rehash of your link list which you have posted here.)

DKF: A general suggestion. Given how many links above are unspecific or just plain irrelevant, why not thin them out a bit so that they are properly meaningful? Better yet, why not post some links to tools that will help with memory debugging? None of the links above (with the possible exception of the ones to pclint - which I can't assess - and Eclipse - which is an unhelpful wall of links) are relevant to the topic of this page (as opposed to general debugging.)

TV Some interesting links, though I had in mind the general subject which is not so much debugging or 'black box' [analysis], but as always it is better to know how the whole game works, that is how do the average and other types of computers deal with memory ([allocation], [process binding] and [limit checking], [virtual addressing] and [page mapping], [page size] options and [overhead]) and what the various [operating systems] which support tcl/tk have to offer, or limit one with.

[Lint] is an analyzer from the unix time, which should be fine, the hoard package is for [distributed computing] which is not explicitly supported in tcl, though bare sockets get you there fine. The [gnu debugger] (the other day I found out it now lives also in cygwin under the name of 'insight') which has a tk ui is very powerful tool in the hands of the serious developer which allows assembly level (lower one cannot go, usually, microprogrammability, though not a bad idea for these issues, is out since the death of some PDP's that could and probably some exceptions) debugging, mixed with C source code, and stack tracing of nested functions calls with the common unix DB like tracing and breakpointing possibilities. It does deal with multitasking and I've at least used interrupt driven sound programs with it, and allows remote (socket based) operation, so you can unburden the machine running the program under test from it, letting it largely unaffected by the test (debug) program, apart from some tcp/ip networking overhead.

I'm sure I did not really bring up a quality problem primarily, though that could be, its just straight thinking: I load a program, have so many libraries which I need and normally therefore must load in core, read so much data, which expands to a certain amount of core data, and then the little sum seems often way on the unnecessarily high side, and at times ridiculously so. When things start running, it gets even further, because then the dynamics kick in, meaning memory, even though it is not used insipidly, gets fragmented, leaving a lot of unusable pieces in the main memory so that effectively things clog up. Which at least doesn't happen much under tcl in my experience.

Obviously, certain branches of software/informatics land are driven or led astray if you like by all kinds of not even secondary considerations, and it often is just too apparent they don't really know what they are talking about, yet ask for a lot of air time, attention, and empathy, and if you're unlucky to work for them, sucking up, without good contentwise reason.

DKF: It's probably easier to talk specific cases. There's relatively little flab in the core, but lots of caching. You can eliminate many of those caches (and tune down the default sizes of a number of other structures) but you won't necessarily gain, especially when you factor in Time costs as well.

elfring 27 Aug 2003 I assume that the package "TclTest" is heavily used. A lot of [test cases] are automatically executed and checked. But I think that the usual developer [experience] is that some [leaks] are left and new [problems] can creep into the software.

The search for [programming errors] need not be done manually all the time. - Let tools do the job for us. I pointed to some [free tools] like the [lint] variants that perform [code analysis] and [flaw detection]. I hope that TCL [implementations] will be available for that in the future. Java classes can be called by the package "TclJava".

I asked the SourceForge administration to add them to the debugging toolset.

Are you sure that the scanner [nessus] [3] does not detect [holes] in TCL [servers]? ;-)
How much can we control the [memory management strategies] of the diverse libraries? It is very hard.

DKF: You seem to be under the impression that a tool can do a better job of finding flaws than a person. This is bullshit. A tool may apply a simplistic rule to loads of stuff, but it does not provide an easier way of determining whether the code is actually correct unless the specification of the code is utterly nailed down. But nobody doing such work uses Tcl, Java or C (except as a stage inside a compiler for some other language.) Those tools also cost a fortune because the only people who really need them are the safety-critical-software-systems people, and they have the budget. They also have the training in writing formal mathematical specifications and in specification refinement and retrenchment. This is a very specialized area.

In the mean time, no tool can check the spec against the code. Professional coverage and memory analysis tools help (I've nothing against free tools per-se, but if they don't let me track down new problems they're a waste of my time), but they don't work well with sophisticated cases (which are fairly common in the Tcl core which has been heavily worked on for years.) And then you have to worry about whether the spec itself is correct. It is known that the test suite is incomplete and in some cases arguably wrong. But what tool could generate a correct case? Generating a case that the current implementation says is correct is pretty easy to do, but ass-backwards.

And what has this got to do with the subject of the page?

BTW, Nessus is irrelevant to this discussion. It tests applications. Not scripting languages. (It's current attacks detect 3 flaws related to Tcl-based things, none of which are inherent in the language but instead in the way that the language was used.) STAY ON TOPIC!

TV The whole idea of using tcl and in another sense tk is to NOT have to deal much with the memory allocation problem, which does not lead to large processes and much memory use by tcl and tk.

The malloc game, for those unfamiliar, goes as follows:

   #include<malloc.h>
   char *basepointer;
   basepointer = (char *) malloc(1000);

   for (i=0; i<1000; i++)
      basepointer+i = (char) 0;

In C code, that is. It can also appear in assembly code, and other languages, but the name and widespread use to my knowledge comes from C/Unix, and it is fundamental in Unix, Linux, and I guess basically the same in windows, though that I'm not intimately aware of. I like using the cygwin unix/posix abstraction layer for ease of use...

The first line makes sure the function is taken with the right argument size, for instance an integer, then a 'pointer' is declared to point to characters (bytes) and the have name basepointer, which we set to the return value of the malloc (memory allocate) function.

After it has returned, the malloc function has either returned a NULL pointer, meaning memory didn't get allocated, or a pointer which can be used to find the lowest memory address of a contiguous piece of memory, guaranteed of at least the requested size.

   memory
   0                                   basepointer             basepointer+1000                                          end of memory
   -----------------------------------------------------------------------------------------------------------------------
   |                                   |         m1            |                                                         |
   -----------------------------------------------------------------------------------------------------------------------

The basepointer contains an integer number between zero and the number of bytes in the memory -1 -1000, and points to the ' base' of the allocated piece of memory m1.

The operating system, which has to do with the malloc function, it usually doesn't exist as an independent library function, remembers it gave the piece of memory m1 to our process which did the malloc call in some of its internal tables, of course also costing some memory. The computer hardware, possibly the pentium chip, risc processor or what else deals with memory protection in your computer (in very old ones that is nothing) must also be informed that when your process receives processor time, that is the time slicer which gives us the nice semi-parallel tasks on all modern OS-es, it has been given access to our m1 memory chunk, and usually all other processes must e denied access, and generate the familiar 'segmentation violation' when they would access m1. All this must also take into account that virtual addressing doesn't necessarily treat the whole main and especially the virtual memory as a linear list of addresses, although in practice this is the most convenient for the usual setup of a core memory of chips with memory banks, and a threefold virtual extension by swapping pieces of memory to and from a fast reachable piece of the harddisc.

The function free can be used to give back the piece of memory to be reused at will by the operating system:

   free(*base_pointer);

After which the memory location pointed to by base_pointer and the subsequent 999 can not be read from or written to anymore, which then generates an error on machines with memory protection. We could also repeatedly allocate pieces of memory, by repeatedly calling malloc:

   for (i=0; i<n; i++) {
      p1[i] = malloc(10000);
      p2[i] = malloc(1000);
   }

Which in this fashion would usually be a programming logic error, when the pieces are made at the same time, why waste so much overhead of creating them, storing them and calling malloc many times, we could alternatively:

   p = malloc(n*10000+n*1000);
   for (i=0; i<n; i++) {
      p1[i] = p+(10000+1000)*i;
      p2[i] = p+(10000+1000)*i+10000;
   }

The advantage of using the fragmented approach is that we can feed back pieces of memory to the pool of free memory the OS keeps as soon as we are done using one or more of them.

Now assume with a freshly started machine (ahem, this is depending on the availability of a non perfectly memory address translating / permutating machine with limited page/segment register size, which is probably justified) that we are the first big memory spending process to run, and that we called one of the above and have chosen n such that we fill up the major part of the core memory.

Now we free for instance the smaller chunks, and want the allocate a few pieces of a few thousand size, and find that either we cannot, because memory is considered full, or we start getting portions of memory written to the swap file first.

This can be worse, suppose we 'free' the large chunks, and want to have n/2 chunks of size 12000. It could be we cannot, or that the total recorded memory in use for the process effectively doubles.

Paging and segmenting makes things more complicated. Not just the just mentioned 'gaps' become important for a subsequent malloc of a bigger piece of memory, but also the rounding to page boundaries.

Various solutions or trouble prevention schemes exist. Mainly, an application can rely on using malloc, but must keep in mind that it preferably reuses its allocated memory first before it frees pieces, and that it allocates it in reasonably big pieces.

On some machines, I suspect at least older windows ones, it is not bad idea to at startup claim sufficient memory for the lifetime of a program, and do your own internal memory management. Circular reuse of buffer space, for instance.

Applications like Tcl and Tk which can easily grow with time, and in principle also shrink, a reasonable memory management, and a decent, not too fine grained and too often timed malloc use should be ok enough.

In Tk, memory use like in any graphics program is at least governed by dealing with pixel/bit maps.

Earlier graphics libraries would make graphical elements which would be stored as lines, rectangles and circles of a certain colour, and the pixelmap of the computer screen would be written to on the basis of lists of such elements at the time of display.

Currently, when one uses a double buffered true colour with alpha channel OpenGL or alike screen with hundreds of thousands of interpolated colour polygons and various large texture images, dealings with graphics related memory are different.

Our list of simple elements method would on machines like the Atari ST I programmed lead to what are called 'redraw' routines, which would sort of literally 'redraw' portions of the screen which need updating, for instance because of moving a window or closing a dialog box.

This could be done without any intermediate pixel level storage.

Suppose we have n windows of average size a*a, and let a be significant part of the visible screen, lets assume of 1000*1000 true color pixels (3 bytes / pixel, meaning 3 megabytes of frame buffer), if we take 5 windows with a equals 500, we would want to have a total pixel buffer space for all windows separately of 5 * 500 * 500 * 3 bytes = 3.75 million bytes.

That is a reason for a lot of applications needing lots of memory.

And maybe some programmers like to make use of even more buffers for a piece of screen memory, an unscrolled full size graphics window, uncompressed image used in the in windows, full size partial pixel maps, etc. Not to mention intermediate blitting pixel maps.

For 3D, you may have voxel (3D space partitioning), fog, light, shadow, reflection and mirror maps, Z buffer memory, of course the scene geometry and graphical element attribute data.

Natural Tk can be pretty much without intermediate pixmap storage, most of its elements can be drawn, and since there are limited, reused by many graphical elements, their redraw can be made intelligent, so that it can take little storage to make complex interfaces appear on the screen, with the expense of redraw processor demands and possible delay, though even a 75 MHz pentium with windows 3.1 can work fine.

PT: I think it would be useful to read [4] which discusses kernel memory management. In short, on unix free doesn't return memory to the OS but releases it back to the heap. The brk and sbrk system calls are used to handle the size of the heap. In general, malloc is conservative. The assumption is that if a process once needed a certain amount of memory, then it is likely to need it again.

On Windows, an application may manage multiple heaps. Visual Studio is an example of an application which does this. The principal problem I see is for server-type tcl applications which are long lived processes. For these cases it would be useful to have a separate heap per interpreter which could then be released to the OS when the interp is deleted. Anything else seems unlikely to be worth the development effort.

TV Interesting reading material. What it does not discuss however, is how virtual memory management actually works, I mean the heap story is interesting at the process API level but how that translated to the machine, in the case of linux a PC architecture, and in the case of unix a probably more advanced workstation/mini architecture, Mac I suspect of not so strong multitasking infrastructure for bit older machines, though at least even powermacs had a decent 68000 based processor (with a linear memory by nature).

How does a virtual page get mapped to a physical one, and, related, how does the pointer malloc returns relate to what the pentium uses as address.

The problems I described above still very much hold. If you'd malloc certain pieces of memory on the heap or wherever they go, and you free some afterward, you'll end up with fragmented memory, which doesn't allow you to continue without solving at some point that the sum of all memory chucks you officially have is less than the address space being taken up, and worsening as fragmentation continues. Paging makes things such as word alignment happen on the page level, depending on how much you malloc exactly. Alignment issues also make you use more actual memory space than the sum of all officially allocated pieces.

With extensive paging, you can achieve a very large effective memory space, but even more, you can 'shuffle' the virtual memory of for instance the contiguous basic process memory space into the actual physical and (also virtual) swap memory. This required preferably low level memory paging support, and requires a memory address translation table, which gets necessarily larger for smaller or more memory pages, and in principle incurs processing (speed penalty) overhead. When pages can be grouped into segments, both can be reduced.

I remember from the pentium datasheets that it has various kinds of segment, I think also with varying segment paging sizes and options, but with horrible address translation compatibility issues.

DKF: There's little point in worrying about physical memory if you're not writing an OS or a hardware driver (especially a hardware driver that uses DMA to shift bytes) since all machines (except for some embedded stuff) these days operate using virtual memory which hides all the details. VM is handled in specialized hardware, and it typically works by effectively rewriting every pointer (on a page-by-page basis) that the processor issues when reading from or writing to memory (triggering an exception/trap when no mapping is held, which is handled by mapping the page in, allocating a new page, throwing a signal, etc.) The details vary according to the hardware platform, but should be ignored at the Tcl level. All you need to know really is that the OS allocates memory to processes in chunks called pages (whose size can be looked up) and with the right syscalls, you can return those pages back to the OS's pool of unused pages.

TV Tell that to windows. Or Linux. The schemes I described above unfortunately can hold very well, so one does have to take into account that often repeated malloc/free calls can make memory usage patterns which effectively span also a lot more pages than the sum of the officially allocated memory. And in spite of virtual (process or process group based) memory being possibly large, that will eat up main memory, too, if all those pages and their little remainders of some officially allocated memory pieces are referred.

Everybody knows about hard disc defragmentation. Memory is different, but unless you have a perfectly, subpage level reallocating machine and os, you will run into such issues, and unfortunately they do have to be dealt with at times, also on the application level.

Also I'm sure I remember that the pentium handles memory protection, paging and address translation at least in some segmenting way, where every segment requires a descriptor, the details have escaped me for the moment.

I know very well that the order and way in which memory eating programs, including for instance a big database in tcl (which I find quite interesting subject, because lists and tcl memory descriptors can hide and be unaffected by paging and virtual memory) or web browsers and all kinds of applications are started makes a major difference for a variety of machines and os-es in terms of when they require or assign themselves a new startup...

DKF: Ignore physical memory. Seriously. (I take it you're not hacking the low-level guts of an OS or a device driver that requires DMA or memory-mapped devices; they're the only time you need to understand physical memory.) When you're working with Tcl, you're working in the domain of virtual memory. (The way VM<->PM mapping works is interesting, but varies between architectures a lot and is thankfully something that OSes abstract away from for you.)

Instead, let us look at how memory allocators work. The OS allocator hands you memory by the page; that's what the OS works with. Generally, it can do so in one of two ways; extending an existing block of pages (this is what happens with sbrk() IIRC) and creating a new page block (which you do on Unix usually by mmap()ing either /dev/zero or a null filehandle IIRC.) However, a page is a rather large thing to allocate for many uses, so your C library puts another layer on top of that to cope with small chunks, and it is that layer that you access through malloc() and friends. The main thing to note here is that there is a cost associated with using this layer, in that it is not that fast (mallocs make a good unit of costing in performance evaluations of a complex algorithm) and there is a space overhead per allocation (I see 8 bytes on many systems, presumably enough for some magic and the length of the allocated block.)

The real problem with many programs (and especially long-lived ones) is that they end up fragmenting memory so that they have lots of gaps between allocated memory blocks which are just slightly too small for the memory blocks that you want to allocate. This can be a real PITA to deal with! The most common way to fix these sorts of things is to use an aggregating allocator so that all memory blocks of a certain size are allocated together. Tcl uses this strategy normally for allocating Tcl_Obj structures, and it can greatly enhance the overall efficiency. (You can push things a lot harder of course, but then it gets harder to port to other architectures and the returns drop off too. Its usually only worthwhile doing this for structures which you allocate a lot of.)

I should note here that Tcl (in common with classic versions of malloc()) uses a high-water-mark memory allocation strategy for Tcl_Obj allocation, as this means that the cost of throwing away an old object and getting a new one is almost zero. This works quite well for many modes of use, and especially those where the program has a well-defined limited lifespan. It's not so good for long-running server applications though. In those cases, you can have spikes in memory usage where you need lots for a short time but don't wish to hang onto it for ages (assuming that the overall virtual memory pool is not exhausted and that not all such apps have major requirements at the same time, this is a pretty good strategy.) Some versions of malloc() (ISTR that the glibc version might be one) already do this, and I wrote an incomplete patch for Tcl to make the allocation of objects use this style while still preserving the pooling behaviour which is known to be very good for performance.

The other thing to note here is that fragmentation can also come about when you're allocating chunks that slowly increase in size (e.g. when you're appending to a list or a string). Tcl handles these cases fairly efficiently, using an exponential growth strategy that minimizes the number of allocations and also the degree of fragmentation. We're also careful to use realloc() quite a bit, which allows the C library a chance to minimize fragmentation when it can.

MSH 06-Sep-2004 One small note for windows users, I use win 2000 and have noticed that my programs do use a lot of memory but that when I iconify the program the memory usage drops dramatically, - as with all other windows programs - then when I resize the program the memory creeps back up but nowhere near as much as before, maybe the memory used reported by the task manager includes memory required by the GDI windows interface? I find in general after this exercise that my programs are not that memory hungry.

DKF: The task manager 'Memory Usage' column tells you how much physical memory is used for the process, together with memory that is really just memory-mapped files (e.g. binaries) and as such not really part of the process. It's not really a very useful measure because it depends not only on how much virtual memory is being used but also on the memory access pattern and the priority of the process. As you noted, this means things change around all over the place (Win treats iconifying as an opportunity to reduce the prio of a process). Try the 'Virtual Memory Size' column instead; it's not turned on by default but you can select it.