From gms@sdf.lonestar.org Wed Apr 17 19:16:05 2002 Received: (from uucp@localhost) by plague.CyberTiggyr.COM (8.10.2/8.10.2) with UUCP id g3HJF8201706 for plague!gene; Wed, 17 Apr 2002 19:15:08 GMT Received: from uunet.UUCP (uucp@localhost) by vex.CyberTiggyr.com (8.10.2/8.10.2) with UUCP id g3HJDYL21393 for gn@cybertiggyr.com; Wed, 17 Apr 2002 19:13:34 GMT Received: from dfw7-1.relay.mail.uu.net by uucp0.ash.ops.us.uu.net with ESMTP (peer crosschecked as: dfw7-1.relay.mail.uu.net [199.171.54.106]) id QQmkzv11291 for <gangrene!cybertiggyr.com!gn>; Wed, 17 Apr 2002 17:49:35 GMT Received: from mail.acm.org by dfw7sosrv11.alter.net with ESMTP (peer crosschecked as: mail.acm.org [199.222.69.4]) id QQmkzv16459 for <gn@CyberTiggyr.COM>; Wed, 17 Apr 2002 17:49:34 GMT Received: from sdf.lonestar.org (IDENT:root@sdf.lonestar.org [207.202.214.132]) by mail.acm.org (8.9.3/8.9.3) with ESMTP id NAA188942; Wed, 17 Apr 2002 13:47:31 -0400 Received: (from gms@localhost) by sdf.lonestar.org (8.11.6/8.11.6) id g3HHnKp02961; Wed, 17 Apr 2002 17:49:20 GMT Date: Wed, 17 Apr 2002 17:49:20 GMT Message-Id: <200204171749.g3HHnKp02961@sdf.lonestar.org> From: Gene Michael Stover <gms@sdf.lonestar.og> To: Subject: Garbage collection performance Status: RO Content-Length: 11945 Lines: 306 Hi All, I'll be making a new release of Giggle soon (today, maybe). Here are the results of some performance comparisons I did. Veeeeeeeeeery interesting... even for people who don't program in C++. If you don't have the energy to read the whole thing (it's long), take a peek at the results table, then skip to the final section ("ONE MORE NOTE ABOUT THE TESTS") where I relate a small case study about programmer productivity & garbage collection; the case study occurred while I was writing the speed-test programs. I hope y'all enjoy it. -------- Giggle... well basically, Giggle rocks, & I'm a genius. I now use "composition" to create garbage collectors. Some of the composition is done at run-time & some at compile-time. To the C++ programmer, that means I use templates & ... well, composition (since there's no C++-specific term for run-time composition). I'll tell more about what I did momentarily, but first, here are the results from a speed test. Sorted from fastest to slowest (& taking only the top 6 fastest performers) we have: iterations time rate program technique 458157788 300 1527192.62 bin/speed0012 raw Boehm 205190888 303 677197.65 bin/speed0011 malloc/free 201504008 303 665029.72 bin/speed0010 new/delete 30771956 303 101557.61 bin/speed1012 CyberTiggyr::giggle::ReferenceCounterT< CyberTiggyr::giggle::Bookkeeping, CyberTiggyr::giggle::OffsetBookkeepingMap<CyberTiggyr::giggle::Bookkeeping>, CyberTiggyr::giggle::RecyclingAllocator<char> > 29982470 310 96717.64 bin/speed1014 CyberTiggyr::giggle::ReferenceCounterT< CyberTiggyr::giggle::Bookkeeping, CyberTiggyr::giggle::OffsetBookkeepingMap<CyberTiggyr::giggle::Bookkeeping>, CyberTiggyr::giggle::MallocAllocator<char> > 27869853 309 90193.69 bin/speed1002 CyberTiggyr::giggle::ReferenceCounter Notice that the top two performers are not specific to C++. (Yes, they _did_ call constructors & destructors.) (The Boehm collector is a small C library. Search Google for "boehm collector", & you'll find it. I use it for C all time nowadays.) WHAT THE TEST IS To understand the results, you must understand the test. Each test is a separate program. A driver program (speed.sh) run each one & times it from launch to finish. (I originally had the speed-test programs time themselves, but some garbage collection techniques (Boehm) does its work after 'main' exits, so I needed to time the process as a whole (long to exit).) A test is a big loop. I chose it to represent a common, non-trivial type of memory use an app might do, but to be simple enough for me to implement as a speed-test. Here's what the loop does (pseudocode): - A Node is a basic linked-list node: struct Node { Node *next }; - Make an array of Node*, call it 'v'. (In the speed-test programs, it's actually an stl::vector<NodePtr>. - Until time runs out (5 minutes): - Let i be a randomly selected index in v. - Make a new Node. - Roll a die: If die == 1, forget about the linked list that is v[i] (garbage collector might collect it), Otherwise, Node.next = v[i] (so we're extending the linked list that is v[i]). - v[i] = Node. (So we're putting the new list in v[i]. If die == 1, then we're putting an entirely new list of length 1 there, & we'll probably grow it in later iterations. If die != 1, then we've extended the list that was already there.) - Return the number of times we made it through the loop. So the test makes a bunch of seaweed (linked lists) that grow (one Node at a time), & sometimes a stalk of seaweed is chopped off (forgotten so the garbage collector can collect it). Note: - the array (v) has 1000 elements - the "die" has 1000 sides - the mean linked list will have 1000 elements in it There are no circular references in the data structure. I wanted this very same speed-test to work with Giggle's garbage collectors & with the usual memory management schemes that did not use garbage collection at all so I could compare them. So the loop (which is a function unto itself) has arguments that are pointers to functions to do all the things it needs: make_node, free_list, and clear_ptr. When using the loop with a Giggle garbage colletor, make_node does "new (gc ()) Node" and free_list does nothing. With raw Boehm, make_node allocates space for a Node from the Boehm collector, & free_list does nothing. (Also explicitly calls Node's constructor & destructor.) With malloc/free, make_node allocates with malloc and free_list frees each node in the linked list using 'free'. (Also explicitly calls Node's constructor & destructor.) With new/delete, make_node uses "new Node" and free_list deletes each node in the linked list. I needed to creat a different speed-test program for each Giggle garbage collector, so I wrote a program to write the test programs. Actually, all those speed-test programs use the same innards (src/Speed.cxx), but the program-writing program just writes the 'main' for each type of Giggle garbage collector. (There are 27, if I remember.) I didn't list all the types of garbage collectors for the program-writing program; there are too many for that. Instead, I gave it the list of parts, & it combined them exhaustively to produce the complete list of garbage collectors. (The program-writing program is src/create-tests.sh.) HOW THE RESULTS ARE DISPLAYED The columns are: - iterations : number of times through the loop; equivalently, the number of Nodes that were created - time : number of seconds the entire program ran (measured by speed.sh, which got the "date +%s" before it ran the speed-test and right after the speed-test finished). When it's greater than 300 seconds, that's because the speed-test program probablly looped for 300 seconds, but there was a cost to shutdown the program. - rate : iterations per second, computed as (iterations / time). The rows are sorted from highest rate to least. - program : the name of the program file, in case it's ever needed - technique : the memory management technique; see the next section THE GARGAGE COLLECTOR TECHNIQUES The techniques are: - raw Boehm : called GC_malloc to create nodes, didn't do anything to de-allocate them; let Boehm do that. (Explicitly called Node's constructor & destructor.) - malloc/free : explicitly created & destroyed each node using malloc & free. (Explicitly called Node's constructor & destructor.) - new/delete : explicitly created & destroyed each Node with C++'s built-in "new" and "delete". - ReferenceCounterT : It's a parameterized Giggle garbage collector. In order, the parameters are: - Bookkeeping : All Giggle garbage collectors keep information about data blocks. Currently, I've implemented one data structure they all use to keep that information. That's the Bookkeeping data structure. To optimize memorey use, I could implement data structures that are specific to certain garbage collectors, but it wouldn't save much space, I don't think. - OffsetBookkeepingMap : A Giggle garbage collector must map (void *) to the bookkeeping data. I implemented a couple such maps. It turns out the fastest, in all cases, is OffsetBookkeepingMap. It assumes the bookkeeping data was allocated at the front of the block the client requested. So given the pointer to the block from the client's point of view, the address of the bookkeeping data is obtained by subtracting an offset from that pointer. - RecyclingAllocator : A C++ allocator that recycles data blocks. When the garbage collector tells the RecyclingAllocator to free a block of memory, the RecyclingAllocator places it in a free list. When the garbage collector asks for a new block, the RecyclingAllocator gives it one from the free list if the free list isn't empty. This RecyclingAllocator is very naive, keeping a free list for each block size up to 32 KB. Even so, it gives good performance. - MallocAllocator : A C++ allocator that uses malloc/free. Yup, it turns out to be much faster than the std::allocator that comes with Gnu GCC 3.0.4. - ReferenceCounter : A Giggle garbage collector that instantiates a ReferenceCounterT with OffsetBookkeeping and MallocAllocator inside its .cxx file to avoid code bloat. WHAT THE RESULTS MEAN Notice that raw Boehm is the fastest by a factor of 2.5. This is not a mistake. Garbage collection is not the necessary bottleneck that people assume it is. In these tests, though, the garbage collector has maximum advantage: - all objects are of the same size (struct Node) - there are no cycles Then again, the manual/explicit techniques have the same advantage. Heck, you'd think they'd have a greater advantage because "free_list" for them is just walking through a list & freeing the nodes as it goes. So I take it back: The garbage collector does not have an advantage in this test. I'm sure my naive RecyclingAllocator was able to take advantage of the "all blocks are the same size" feature of the test. Notice that malloc/free is faster than new/delete. This coincides with some observations I've made with my unit test programs and my MallocAllocator (malloc/free) and NewAllocator (new/delete). It looks to me like the Boehm collector might be the best bet in the general cast. If you want to have the freedom to experiment with different garbage collection algorithms, maybe even creating a garbage collector that is optimal for your application's domain, or you want to mix & match garbage collection with explicit memory management, use Giggle's ReferenceCounter. Giggle's ReferenceCounter has greater than 1/8th the speed of explicit malloc/free, & the raw Boehm collector has 2.5 times the speed of malloc/free. ONE MORE NOTE ABOUT THE TESTS When the speed-test program worked only for Giggle's garbage collectors, it had one more feature. In the current speed test, there are two outcomes when the die is rolled: - die == 1 : chop-off a current list & start a new one - die != 1 : extend a current list There was originally a third option: - die == 2 : chop the current list, then let v[i] = v[rand () % length], extend the new v[i] The third option caused an existing list to be chopped & to be replaced by another existing list, then extended. So the result was that the array 'v' held growing trees (upside down, with their leaves "rooted" in 'v'). When I modified the speed-test to work with explicit alloction (malloc/free, new/delete), I couldn't do this any more because, when a list was chopped & sent to 'free_list', how would free_list know which nodes in its singly linked list were in that list only, & needed deleting, & which were shared with another list? It couldn't know at all unless I gave it more information. I could have added information to struct Node & a little processing. I could have given 'free_list' the list to free & a pointer to the entire array, but then it would have had to do non-trivial scanning of the array to find wich elements were shared & should not be deleted. That would have been buggy, & it would have been ... implementing garbage collection for a specific case. What's the point? So I had to remove that part of the speed test to accomodate manual memory management. Notice that the older, full version of the speed-test did not have a complex data structure. It was a tree: a useful data structure with no cycles. It was a pretty simple data structure, in fact. Easy for any garbage collector to handle. Even reference counting had no trouble with it, but manual/explicit memory management was not adequate for the task. How many other data structures & algorithms are effectively unmanageable & infeasable because people don't use garbage collection?