Next: History Up: Giggle Manual 0.7 Previous: Mixing-up Algorithms
Performance & Lessons Learned

From gms@sdf.lonestar.org Wed Apr 17 19:16:05 2002
Received: (from uucp@localhost)
	by plague.CyberTiggyr.COM (8.10.2/8.10.2) with UUCP id g3HJF8201706
	for plague!gene; Wed, 17 Apr 2002 19:15:08 GMT
Received: from uunet.UUCP (uucp@localhost)
	by vex.CyberTiggyr.com (8.10.2/8.10.2) with UUCP id g3HJDYL21393
	for gn@cybertiggyr.com; Wed, 17 Apr 2002 19:13:34 GMT
Received: from dfw7-1.relay.mail.uu.net by uucp0.ash.ops.us.uu.net with ESMTP 
	(peer crosschecked as: dfw7-1.relay.mail.uu.net [199.171.54.106])
	id QQmkzv11291
	for <gangrene!cybertiggyr.com!gn>; Wed, 17 Apr 2002 17:49:35 GMT
Received: from mail.acm.org by dfw7sosrv11.alter.net with ESMTP 
	(peer crosschecked as: mail.acm.org [199.222.69.4])
	id QQmkzv16459
	for <gn@CyberTiggyr.COM>; Wed, 17 Apr 2002 17:49:34 GMT
Received: from sdf.lonestar.org (IDENT:root@sdf.lonestar.org [207.202.214.132])
	by mail.acm.org (8.9.3/8.9.3) with ESMTP id NAA188942;
	Wed, 17 Apr 2002 13:47:31 -0400
Received: (from gms@localhost)
	by sdf.lonestar.org (8.11.6/8.11.6) id g3HHnKp02961;
	Wed, 17 Apr 2002 17:49:20 GMT
Date: Wed, 17 Apr 2002 17:49:20 GMT
Message-Id: <200204171749.g3HHnKp02961@sdf.lonestar.org>
From: Gene Michael Stover <gms@sdf.lonestar.og>
To: 
Subject: Garbage collection performance
Status: RO
Content-Length: 11945
Lines: 306

Hi All,

I'll be making a new release of Giggle soon (today, maybe).
Here are the results of some performance comparisons I did.
Veeeeeeeeeery interesting... even for people who don't
program in C++.

If you don't have the energy to read the whole thing (it's
long), take a peek at the results table, then skip to the
final section ("ONE MORE NOTE ABOUT THE TESTS") where I
relate a small case study about programmer productivity &
garbage collection; the case study occurred while I was
writing the speed-test programs.

I hope y'all enjoy it.

--------

Giggle... well basically, Giggle rocks, & I'm a genius.  I
now use "composition" to create garbage collectors.  Some of
the composition is done at run-time & some at compile-time.
To the C++ programmer, that means I use templates &
... well, composition (since there's no C++-specific term
for run-time composition).

I'll tell more about what I did momentarily, but first, here
are the results from a speed test.  Sorted from fastest to
slowest (& taking only the top 6 fastest performers) we have:

iterations  time          rate        program  technique
 458157788   300    1527192.62  bin/speed0012  raw Boehm
 205190888   303     677197.65  bin/speed0011  malloc/free
 201504008   303     665029.72  bin/speed0010  new/delete
  30771956   303     101557.61  bin/speed1012  CyberTiggyr::giggle::ReferenceCounterT< CyberTiggyr::giggle::Bookkeeping, CyberTiggyr::giggle::OffsetBookkeepingMap<CyberTiggyr::giggle::Bookkeeping>, CyberTiggyr::giggle::RecyclingAllocator<char> >
  29982470   310      96717.64  bin/speed1014  CyberTiggyr::giggle::ReferenceCounterT< CyberTiggyr::giggle::Bookkeeping, CyberTiggyr::giggle::OffsetBookkeepingMap<CyberTiggyr::giggle::Bookkeeping>, CyberTiggyr::giggle::MallocAllocator<char> >
  27869853   309      90193.69  bin/speed1002  CyberTiggyr::giggle::ReferenceCounter

Notice that the top two performers are not specific to C++.
(Yes, they _did_ call constructors & destructors.)  (The
Boehm collector is a small C library.  Search Google for
"boehm collector", & you'll find it.  I use it for C all
time nowadays.)

WHAT THE TEST IS

To understand the results, you must understand the test.

Each test is a separate program.  A driver program
(speed.sh) run each one & times it from launch to finish.
(I originally had the speed-test programs time themselves,
but some garbage collection techniques (Boehm) does its work
after 'main' exits, so I needed to time the process as a
whole (long to exit).)

A test is a big loop.  I chose it to represent a common,
non-trivial type of memory use an app might do, but to be
simple enough for me to implement as a speed-test.  Here's
what the loop does (pseudocode):

- A Node is a basic linked-list node:
  struct Node { Node *next };

- Make an array of Node*, call it 'v'.  (In the speed-test
  programs, it's actually an stl::vector<NodePtr>.

- Until time runs out (5 minutes):

  - Let i be a randomly selected index in v.
  - Make a new Node.
  - Roll a die:

    If die == 1, forget about the linked list that is v[i]
    (garbage collector might collect it),

    Otherwise, Node.next = v[i] (so we're extending the
    linked list that is v[i]).

  - v[i] = Node.  (So we're putting the new list in v[i].
    If die == 1, then we're putting an entirely new list of
    length 1 there, & we'll probably grow it in later
    iterations.  If die != 1, then we've extended the list
    that was already there.)

- Return the number of times we made it through the loop.

So the test makes a bunch of seaweed (linked lists) that
grow (one Node at a time), & sometimes a stalk of seaweed is
chopped off (forgotten so the garbage collector can collect
it).

Note:
   - the array (v) has 1000 elements

   - the "die" has 1000 sides
  
   - the mean linked list will have 1000 elements in it

There are no circular references in the data structure.

I wanted this very same speed-test to work with Giggle's
garbage collectors & with the usual memory management
schemes that did not use garbage collection at all so I
could compare them.  So the loop (which is a function unto
itself) has arguments that are pointers to functions to do
all the things it needs: make_node, free_list, and
clear_ptr.

When using the loop with a Giggle garbage colletor,
make_node does "new (gc ()) Node" and free_list does
nothing.

With raw Boehm, make_node allocates space for a Node from
the Boehm collector, & free_list does nothing.  (Also
explicitly calls Node's constructor & destructor.)

With malloc/free, make_node allocates with malloc and
free_list frees each node in the linked list using 'free'.
(Also explicitly calls Node's constructor & destructor.)

With new/delete, make_node uses "new Node" and free_list
deletes each node in the linked list.

I needed to creat a different speed-test program for each
Giggle garbage collector, so I wrote a program to write the
test programs.  Actually, all those speed-test programs use
the same innards (src/Speed.cxx), but the program-writing
program just writes the 'main' for each type of Giggle
garbage collector.  (There are 27, if I remember.)  I didn't
list all the types of garbage collectors for the
program-writing program; there are too many for that.
Instead, I gave it the list of parts, & it combined them
exhaustively to produce the complete list of garbage
collectors.  (The program-writing program is
src/create-tests.sh.)

HOW THE RESULTS ARE DISPLAYED

The columns are:

- iterations : number of times through the loop;
  equivalently, the number of Nodes that were created

- time : number of seconds the entire program ran (measured
  by speed.sh, which got the "date +%s" before it ran the
  speed-test and right after the speed-test finished).  When
  it's greater than 300 seconds, that's because the
  speed-test program probablly looped for 300 seconds, but
  there was a cost to shutdown the program.

- rate : iterations per second, computed as (iterations /
  time).  The rows are sorted from highest rate to least.

- program : the name of the program file, in case it's ever
  needed

- technique : the memory management technique; see the next
  section

THE GARGAGE COLLECTOR TECHNIQUES

The techniques are:

- raw Boehm : called GC_malloc to create nodes, didn't do
  anything to de-allocate them; let Boehm do that.
  (Explicitly called Node's constructor & destructor.)

- malloc/free : explicitly created & destroyed each node
  using malloc & free.  (Explicitly called Node's
  constructor & destructor.)

- new/delete : explicitly created & destroyed each Node with
  C++'s built-in "new" and "delete".

- ReferenceCounterT : It's a parameterized Giggle garbage
  collector.  In order, the parameters are:

  - Bookkeeping : All Giggle garbage collectors keep
    information about data blocks.  Currently, I've
    implemented one data structure they all use to keep
    that information.  That's the Bookkeeping data
    structure.  To optimize memorey use, I could implement
    data structures that are specific to certain garbage
    collectors, but it wouldn't save much space, I don't
    think.

  - OffsetBookkeepingMap : A Giggle garbage collector must
    map (void *) to the bookkeeping data.  I implemented a
    couple such maps.  It turns out the fastest, in all
    cases, is OffsetBookkeepingMap.  It assumes the
    bookkeeping data was allocated at the front of the block
    the client requested.  So given the pointer to the block
    from the client's point of view, the address of the
    bookkeeping data is obtained by subtracting an offset
    from that pointer.

  - RecyclingAllocator : A C++ allocator that recycles data
    blocks.  When the garbage collector tells the
    RecyclingAllocator to free a block of memory, the
    RecyclingAllocator places it in a free list.  When the
    garbage collector asks for a new block, the
    RecyclingAllocator gives it one from the free list if
    the free list isn't empty.  This RecyclingAllocator is
    very naive, keeping a free list for each block size up
    to 32 KB.  Even so, it gives good performance.

  - MallocAllocator : A C++ allocator that uses
    malloc/free.  Yup, it turns out to be much faster than
    the std::allocator that comes with Gnu GCC 3.0.4.

- ReferenceCounter : A Giggle garbage collector that
  instantiates a ReferenceCounterT with OffsetBookkeeping
  and MallocAllocator inside its .cxx file to avoid code
  bloat.

WHAT THE RESULTS MEAN

Notice that raw Boehm is the fastest by a factor of 2.5.
This is not a mistake.  Garbage collection is not the
necessary bottleneck that people assume it is.  In these
tests, though, the garbage collector has maximum advantage:

  - all objects are of the same size (struct Node)

  - there are no cycles

Then again, the manual/explicit techniques have the same
advantage.  Heck, you'd think they'd have a greater
advantage because "free_list" for them is just walking
through a list & freeing the nodes as it goes.  So I take it
back: The garbage collector does not have an advantage in
this test.

I'm sure my naive RecyclingAllocator was able to take
advantage of the "all blocks are the same size" feature of
the test.

Notice that malloc/free is faster than new/delete.  This
coincides with some observations I've made with my unit test
programs and my MallocAllocator (malloc/free) and
NewAllocator (new/delete).

It looks to me like the Boehm collector might be the best
bet in the general cast.  If you want to have the freedom to
experiment with different garbage collection algorithms,
maybe even creating a garbage collector that is optimal for
your application's domain, or you want to mix & match
garbage collection with explicit memory management, use
Giggle's ReferenceCounter.

Giggle's ReferenceCounter has greater than 1/8th the speed
of explicit malloc/free, & the raw Boehm collector has 2.5
times the speed of malloc/free.

ONE MORE NOTE ABOUT THE TESTS

When the speed-test program worked only for Giggle's garbage
collectors, it had one more feature.

In the current speed test, there are two outcomes when the
die is rolled:

   - die == 1 : chop-off a current list & start a new one

   - die != 1 : extend a current list

There was originally a third option:

   - die == 2 : chop the current list, then let v[i] =
     v[rand () % length], extend the new v[i]

The third option caused an existing list to be chopped & to
be replaced by another existing list, then extended.  So the
result was that the array 'v' held growing trees (upside
down, with their leaves "rooted" in 'v').

When I modified the speed-test to work with explicit
alloction (malloc/free, new/delete), I couldn't do this any
more because, when a list was chopped & sent to 'free_list',
how would free_list know which nodes in its singly linked
list were in that list only, & needed deleting, & which were
shared with another list?  It couldn't know at all unless I
gave it more information.  I could have added information to
struct Node & a little processing.  I could have given
'free_list' the list to free & a pointer to the entire
array, but then it would have had to do non-trivial scanning
of the array to find wich elements were shared & should not
be deleted.  That would have been buggy, & it would have
been ... implementing garbage collection for a specific
case.  What's the point?

So I had to remove that part of the speed test to accomodate
manual memory management.

Notice that the older, full version of the speed-test did
not have a complex data structure.  It was a tree: a useful
data structure with no cycles.  It was a pretty simple data
structure, in fact.  Easy for any garbage collector to
handle.  Even reference counting had no trouble with it, but
manual/explicit memory management was not adequate for the
task.

How many other data structures & algorithms are effectively
unmanageable & infeasable because people don't use garbage
collection?
Gene Michael Stover
2002-04-28