From gms@sdf.lonestar.org Wed Apr 17 19:16:05 2002
Received: (from uucp@localhost)
by plague.CyberTiggyr.COM (8.10.2/8.10.2) with UUCP id g3HJF8201706
for plague!gene; Wed, 17 Apr 2002 19:15:08 GMT
Received: from uunet.UUCP (uucp@localhost)
by vex.CyberTiggyr.com (8.10.2/8.10.2) with UUCP id g3HJDYL21393
for gn@cybertiggyr.com; Wed, 17 Apr 2002 19:13:34 GMT
Received: from dfw7-1.relay.mail.uu.net by uucp0.ash.ops.us.uu.net with ESMTP
(peer crosschecked as: dfw7-1.relay.mail.uu.net [199.171.54.106])
id QQmkzv11291
for <gangrene!cybertiggyr.com!gn>; Wed, 17 Apr 2002 17:49:35 GMT
Received: from mail.acm.org by dfw7sosrv11.alter.net with ESMTP
(peer crosschecked as: mail.acm.org [199.222.69.4])
id QQmkzv16459
for <gn@CyberTiggyr.COM>; Wed, 17 Apr 2002 17:49:34 GMT
Received: from sdf.lonestar.org (IDENT:root@sdf.lonestar.org [207.202.214.132])
by mail.acm.org (8.9.3/8.9.3) with ESMTP id NAA188942;
Wed, 17 Apr 2002 13:47:31 -0400
Received: (from gms@localhost)
by sdf.lonestar.org (8.11.6/8.11.6) id g3HHnKp02961;
Wed, 17 Apr 2002 17:49:20 GMT
Date: Wed, 17 Apr 2002 17:49:20 GMT
Message-Id: <200204171749.g3HHnKp02961@sdf.lonestar.org>
From: Gene Michael Stover <gms@sdf.lonestar.og>
To:
Subject: Garbage collection performance
Status: RO
Content-Length: 11945
Lines: 306
Hi All,
I'll be making a new release of Giggle soon (today, maybe).
Here are the results of some performance comparisons I did.
Veeeeeeeeeery interesting... even for people who don't
program in C++.
If you don't have the energy to read the whole thing (it's
long), take a peek at the results table, then skip to the
final section ("ONE MORE NOTE ABOUT THE TESTS") where I
relate a small case study about programmer productivity &
garbage collection; the case study occurred while I was
writing the speed-test programs.
I hope y'all enjoy it.
--------
Giggle... well basically, Giggle rocks, & I'm a genius. I
now use "composition" to create garbage collectors. Some of
the composition is done at run-time & some at compile-time.
To the C++ programmer, that means I use templates &
... well, composition (since there's no C++-specific term
for run-time composition).
I'll tell more about what I did momentarily, but first, here
are the results from a speed test. Sorted from fastest to
slowest (& taking only the top 6 fastest performers) we have:
iterations time rate program technique
458157788 300 1527192.62 bin/speed0012 raw Boehm
205190888 303 677197.65 bin/speed0011 malloc/free
201504008 303 665029.72 bin/speed0010 new/delete
30771956 303 101557.61 bin/speed1012 CyberTiggyr::giggle::ReferenceCounterT< CyberTiggyr::giggle::Bookkeeping, CyberTiggyr::giggle::OffsetBookkeepingMap<CyberTiggyr::giggle::Bookkeeping>, CyberTiggyr::giggle::RecyclingAllocator<char> >
29982470 310 96717.64 bin/speed1014 CyberTiggyr::giggle::ReferenceCounterT< CyberTiggyr::giggle::Bookkeeping, CyberTiggyr::giggle::OffsetBookkeepingMap<CyberTiggyr::giggle::Bookkeeping>, CyberTiggyr::giggle::MallocAllocator<char> >
27869853 309 90193.69 bin/speed1002 CyberTiggyr::giggle::ReferenceCounter
Notice that the top two performers are not specific to C++.
(Yes, they _did_ call constructors & destructors.) (The
Boehm collector is a small C library. Search Google for
"boehm collector", & you'll find it. I use it for C all
time nowadays.)
WHAT THE TEST IS
To understand the results, you must understand the test.
Each test is a separate program. A driver program
(speed.sh) run each one & times it from launch to finish.
(I originally had the speed-test programs time themselves,
but some garbage collection techniques (Boehm) does its work
after 'main' exits, so I needed to time the process as a
whole (long to exit).)
A test is a big loop. I chose it to represent a common,
non-trivial type of memory use an app might do, but to be
simple enough for me to implement as a speed-test. Here's
what the loop does (pseudocode):
- A Node is a basic linked-list node:
struct Node { Node *next };
- Make an array of Node*, call it 'v'. (In the speed-test
programs, it's actually an stl::vector<NodePtr>.
- Until time runs out (5 minutes):
- Let i be a randomly selected index in v.
- Make a new Node.
- Roll a die:
If die == 1, forget about the linked list that is v[i]
(garbage collector might collect it),
Otherwise, Node.next = v[i] (so we're extending the
linked list that is v[i]).
- v[i] = Node. (So we're putting the new list in v[i].
If die == 1, then we're putting an entirely new list of
length 1 there, & we'll probably grow it in later
iterations. If die != 1, then we've extended the list
that was already there.)
- Return the number of times we made it through the loop.
So the test makes a bunch of seaweed (linked lists) that
grow (one Node at a time), & sometimes a stalk of seaweed is
chopped off (forgotten so the garbage collector can collect
it).
Note:
- the array (v) has 1000 elements
- the "die" has 1000 sides
- the mean linked list will have 1000 elements in it
There are no circular references in the data structure.
I wanted this very same speed-test to work with Giggle's
garbage collectors & with the usual memory management
schemes that did not use garbage collection at all so I
could compare them. So the loop (which is a function unto
itself) has arguments that are pointers to functions to do
all the things it needs: make_node, free_list, and
clear_ptr.
When using the loop with a Giggle garbage colletor,
make_node does "new (gc ()) Node" and free_list does
nothing.
With raw Boehm, make_node allocates space for a Node from
the Boehm collector, & free_list does nothing. (Also
explicitly calls Node's constructor & destructor.)
With malloc/free, make_node allocates with malloc and
free_list frees each node in the linked list using 'free'.
(Also explicitly calls Node's constructor & destructor.)
With new/delete, make_node uses "new Node" and free_list
deletes each node in the linked list.
I needed to creat a different speed-test program for each
Giggle garbage collector, so I wrote a program to write the
test programs. Actually, all those speed-test programs use
the same innards (src/Speed.cxx), but the program-writing
program just writes the 'main' for each type of Giggle
garbage collector. (There are 27, if I remember.) I didn't
list all the types of garbage collectors for the
program-writing program; there are too many for that.
Instead, I gave it the list of parts, & it combined them
exhaustively to produce the complete list of garbage
collectors. (The program-writing program is
src/create-tests.sh.)
HOW THE RESULTS ARE DISPLAYED
The columns are:
- iterations : number of times through the loop;
equivalently, the number of Nodes that were created
- time : number of seconds the entire program ran (measured
by speed.sh, which got the "date +%s" before it ran the
speed-test and right after the speed-test finished). When
it's greater than 300 seconds, that's because the
speed-test program probablly looped for 300 seconds, but
there was a cost to shutdown the program.
- rate : iterations per second, computed as (iterations /
time). The rows are sorted from highest rate to least.
- program : the name of the program file, in case it's ever
needed
- technique : the memory management technique; see the next
section
THE GARGAGE COLLECTOR TECHNIQUES
The techniques are:
- raw Boehm : called GC_malloc to create nodes, didn't do
anything to de-allocate them; let Boehm do that.
(Explicitly called Node's constructor & destructor.)
- malloc/free : explicitly created & destroyed each node
using malloc & free. (Explicitly called Node's
constructor & destructor.)
- new/delete : explicitly created & destroyed each Node with
C++'s built-in "new" and "delete".
- ReferenceCounterT : It's a parameterized Giggle garbage
collector. In order, the parameters are:
- Bookkeeping : All Giggle garbage collectors keep
information about data blocks. Currently, I've
implemented one data structure they all use to keep
that information. That's the Bookkeeping data
structure. To optimize memorey use, I could implement
data structures that are specific to certain garbage
collectors, but it wouldn't save much space, I don't
think.
- OffsetBookkeepingMap : A Giggle garbage collector must
map (void *) to the bookkeeping data. I implemented a
couple such maps. It turns out the fastest, in all
cases, is OffsetBookkeepingMap. It assumes the
bookkeeping data was allocated at the front of the block
the client requested. So given the pointer to the block
from the client's point of view, the address of the
bookkeeping data is obtained by subtracting an offset
from that pointer.
- RecyclingAllocator : A C++ allocator that recycles data
blocks. When the garbage collector tells the
RecyclingAllocator to free a block of memory, the
RecyclingAllocator places it in a free list. When the
garbage collector asks for a new block, the
RecyclingAllocator gives it one from the free list if
the free list isn't empty. This RecyclingAllocator is
very naive, keeping a free list for each block size up
to 32 KB. Even so, it gives good performance.
- MallocAllocator : A C++ allocator that uses
malloc/free. Yup, it turns out to be much faster than
the std::allocator that comes with Gnu GCC 3.0.4.
- ReferenceCounter : A Giggle garbage collector that
instantiates a ReferenceCounterT with OffsetBookkeeping
and MallocAllocator inside its .cxx file to avoid code
bloat.
WHAT THE RESULTS MEAN
Notice that raw Boehm is the fastest by a factor of 2.5.
This is not a mistake. Garbage collection is not the
necessary bottleneck that people assume it is. In these
tests, though, the garbage collector has maximum advantage:
- all objects are of the same size (struct Node)
- there are no cycles
Then again, the manual/explicit techniques have the same
advantage. Heck, you'd think they'd have a greater
advantage because "free_list" for them is just walking
through a list & freeing the nodes as it goes. So I take it
back: The garbage collector does not have an advantage in
this test.
I'm sure my naive RecyclingAllocator was able to take
advantage of the "all blocks are the same size" feature of
the test.
Notice that malloc/free is faster than new/delete. This
coincides with some observations I've made with my unit test
programs and my MallocAllocator (malloc/free) and
NewAllocator (new/delete).
It looks to me like the Boehm collector might be the best
bet in the general cast. If you want to have the freedom to
experiment with different garbage collection algorithms,
maybe even creating a garbage collector that is optimal for
your application's domain, or you want to mix & match
garbage collection with explicit memory management, use
Giggle's ReferenceCounter.
Giggle's ReferenceCounter has greater than 1/8th the speed
of explicit malloc/free, & the raw Boehm collector has 2.5
times the speed of malloc/free.
ONE MORE NOTE ABOUT THE TESTS
When the speed-test program worked only for Giggle's garbage
collectors, it had one more feature.
In the current speed test, there are two outcomes when the
die is rolled:
- die == 1 : chop-off a current list & start a new one
- die != 1 : extend a current list
There was originally a third option:
- die == 2 : chop the current list, then let v[i] =
v[rand () % length], extend the new v[i]
The third option caused an existing list to be chopped & to
be replaced by another existing list, then extended. So the
result was that the array 'v' held growing trees (upside
down, with their leaves "rooted" in 'v').
When I modified the speed-test to work with explicit
alloction (malloc/free, new/delete), I couldn't do this any
more because, when a list was chopped & sent to 'free_list',
how would free_list know which nodes in its singly linked
list were in that list only, & needed deleting, & which were
shared with another list? It couldn't know at all unless I
gave it more information. I could have added information to
struct Node & a little processing. I could have given
'free_list' the list to free & a pointer to the entire
array, but then it would have had to do non-trivial scanning
of the array to find wich elements were shared & should not
be deleted. That would have been buggy, & it would have
been ... implementing garbage collection for a specific
case. What's the point?
So I had to remove that part of the speed test to accomodate
manual memory management.
Notice that the older, full version of the speed-test did
not have a complex data structure. It was a tree: a useful
data structure with no cycles. It was a pretty simple data
structure, in fact. Easy for any garbage collector to
handle. Even reference counting had no trouble with it, but
manual/explicit memory management was not adequate for the
task.
How many other data structures & algorithms are effectively
unmanageable & infeasable because people don't use garbage
collection?