Using GC Efficiently – Part 1

So the goal here is to explain the cost of things so you can make good decisions in your managed memory usage – it’s not to explain GC itself - it’s to explain how to use it. I assume most of you are more interested in using a garbage collector than implementing one yourself. It assumes basic understanding of GC. Jeff Richter wrote 2 excellent MSDN articles on GC that I would recommand if you need some background: 1 and 2.

First I’ll focus on Wks GC (so all the numbers are for Wks GC). Then I’ll talk about stuff that’s different for Svr GC and when you should use which (sometimes you don’t necessary have the choice and I’ll explain why you don’t).

Generations

The reason behind having 3 generations is that we expect for a well tuned app, most objects die in Gen0. For example, in a server app, the allocations associated each request should die after the request is finished. And the in flight allocation requests will make into Gen1 and die there. Essentially Gen1 acts as a buffer between yound object areas and long lived object areas. When you look at the number of collections in perfmon, you want to see a low ratio of Gen2 collections over Gen0 collection. The number of Gen1 collections is relatively unimportant. Collecting Gen1 is not much expensive than collecting Gen0.

GC segments

First let’s see how GC gets memory from the OS. GC reserves memory in segments. Each segment is 16 MB. When the EE (Execution Engine) starts, we reserve the initial GC segments - one for the small object heap and the other for the LOH (large object heap).

The memory is committed and decommitted as needed. When we run out of segments we reserve a new one. In each full collection if a segment is not in use it’ll be deleted.

The LOH always lives in its own segments – large objects are treated differently from small objects thus they don’t share segments with small objects.

Allocation

When you allocate on the GC heap, exactly what does it cost? If we don’t need to do a GC, allocation is 1) moving a pointer forward and 2) clearing the memory for the new object. For finalizable objects there’s an extra step of entering them into a list that GC needs to watch.

Notice I said “if we don’t need to do a GC” – this means the allocation cost is proportional to the allocation volume. The less you allocate, the less work GC needs to do. If you need 15 bytes, ask for 15 bytes; don’t round it up to 32 bytes or some other bigger chunk like you used to do when you used malloc. There’s a threshold that when exceeded, a GC will be triggered. You want to trigger that as infrequently as you can.

Another property of the GC heap that distinguishes itself from the NT heap is that objects allocated together stay together on the GC heap thus preserves the locality.

Each object allocated on the GC heap has an 8-byte overhead (sync block + method table pointer).

As I mentioned, large objects are treated differently so for large objects generally you want to allocate them in a different pattern. I will talk about this in the large object section.

Collection

First thing first – when exactly does a collection happen (in other words, when is a GC triggered)? A GC occurs if one of the following 3 conditions happens:

1) Allocation exceeds the Gen0 threshold;

2) System.GC.Collect is called;

3) System is in low memory situation;

1) is your typical case. When you allocate enough, you will trigger a GC. Allocations only happen in Gen0. After each GC, Gen0 is empty. New allocations will fill up Gen0 and the next GC will happen, and so on.

You can avoid 2) by not calling GC.Collect yourself – if you are writing an app, usually you should never call it yourself. BCL is basically the only place that should call this (in very limited places). When you call it in your app the problem is when it’s called more often than you predicted (which could easily happen) the performance goes down the drain because GCs are triggered ahead of their schedule, which is adjusted for best performance.

3) is affected by other processes on the system so you can’t exactly control it except doing your part of being a good citizen in your processes/components.

Let’s talk about what this all means to you. First of all, the GC heap is part of your working set. And it consumes private pages. In the ideal situation, objects that get allocated always die in Gen0 (meaning, they almost get collected by a Gen0 collection and there’s never full collection happening) so your GC heap will never grow beyond the Gen0 size. In reality of course that’s almost never the case. So you really want to keep your GC heap size under control.

Secondly, you want to keep the time you spend in GC under control. This means 1) fewers GCs and 2) fewer high generation GCs. Collecting a higher generation is more expensive than collecting a lower generation because collecting a higher generation includes collecting objects that live in that generation and the lower generation(s). Either you allocate very temporary objects that die really quickly (mostly in Gen0, and Gen0 collections are cheap) or some really long lived objects that stay in Gen2. For the latter case, the usual scenario is the objects you allocate up front when the program starts – for example, in an ordering system, you allocate memory for the whole catalog and it only dies when your app is terminated.

CLRProfiler is an awesome tool to use to look at your GC heap see what’s in there and what’s holding objects alive.

How to organize your data

1) Value type vs. Reference type

As you know value types are allocated on the stack unlike reference types which are allocated on the GC heap. So people ask how you decide when to use value types and when to use reference types. Well, with performance the answer is usually “It depends” and this one is no different (did you actually expect something else? J). Value types don’t trigger GCs but if your value type is boxed often, the boxing operation is more expensive than creating an instance of a reference type to begin with; and when value types are passed as parameters they need to be copied. But then if you have a small member, making it a reference type incurs a pointer size overhead (plus the overhead for the reference type). We’ve seen some internal code where making it inline (ie, as a value type) improved perf as it decreased working set. So it really depends on your types’ usage pattern.

2) Reference rich objects

If an object is reference rich, it puts pressure on both allocation and collection. Each embedded object incurs 8 bytes overhead. And since allocation cost is proportional to allocation volume the allocation cost is now higher. When collecting, it also takes more time to build up the object graph.

As far as this goes, I would just recommand that normally you should just organize your classes according to their logical design. You don’t want to hold other objects alive when you don’t have to. For example, you don’t want to store references of young objects in old objects if you can avoid it.

3) Finalizable objects

I will cover more details about finalization in its own section but for now, one of most important things to keep in mind is when a finalizable object gets finalized, all the objects it holds on to need to be alive and this drives the cost of GC higher. So you want to isolate your finalizable objects from other objects as much as you can.

4) Object locality

When you allocate the children of an object, if the children need to have similar life time as their parent they should be allocated at the same time so they will stay together on the GC heap.

Large Objects

When you ask for an object that’s 85000 bytes or more it will be allocated on the LOH. LOH segments are never compacted – only swept (using a free list). But this is an implementation detail that you should NOT depend on – if you allocate a large object that you expect to not move, you should make sure to pin it.

Large objects are only collected with every full collection so they are expensive to collect. Sometimes you see after a full collection the Gen2 heap size doesn’t change much. That could mean the collection was triggered for the LOH (you can judge by looking at the decrease in the LOH size reported by perfmon).

A good practice with large objects is to allocate one and keep reusing it so you don’t incur more full GCs. If say you want a large object that can hold either 100k or 120k, allocate one that’s 120k and reuse that. Allocating many very temporary large objects is a very bad idea ‘cause you’ll be doing full collections all the time.

That’s all for Part 1. In the future entries I’ll cover things like pinning, finalization, GCHandles, Svr GC and etc. If you have questions about the topics I covered in this entry or would like more info on them feel free to post them.

Using GC Efficiently – Part 2

In this article I’ll talk about different flavors of GC, the design goals behind each of them and how they work differently from each other so you can make a good decision of which flavor of GC you should choose for your applications.

Existing GC flavors in the runtime today

We have the following flavors of GC today:

1) Workstation GC with Concurrent GC off

2) Workstation GC with Concurrent GC on

3) Server GC

If you are writing a standalone managed application and don’t do any config on your own, you get 2) by default. This might come as a surprise to a lot of people because our document doesn’t exactly mention much about concurrent GC and sometimes refers to it as the “background GC” (while refering Workstation GC as “foreground GC”).

If your application is hosted, the host might change the GC flavor for you.

One thing worth mentioning is if you ask for Server GC and the application is running on a UP machine, you will actually get 1) because Workstation GC is optimized for high throughput on UP machines.

Design goals

Flavor 1) is designed to achieve high throughput on UP machines. We use dynamic tuning in our GC to observe the allocation and surviving patterns so we adjust the tuning as the program runs to make each GC as productive as possible.

Flavor 2) is designed for interactive applications where the response time is critical. Concurrent GC allows for shorter pause time. And it trades some memory and CPU to achieve this goal so you will get a slightly bigger working set and slightly longer collection time.

Flavor 3), as the name suggests, is designed for server applications where the typical scenario is you have a pool of worker threads that are all doing similar things. For example, handling the same type of requests or doing the same type of transactions. All these threads tend to have the same allocation patterns. The server GC is designed to have high throughput and high scalibility on multiproc machines.

How they work

Let’s start with Workstation GC with Concurrent GC off. The flow goes like this:

1) A managed thread is doing allocations;

2) It runs out of allocations (and I’ll explain what this means);

3) It triggers a GC which will be running on this very thread;

4) GC calls SuspendEE to suspend managed threads;

5) GC does its work;

6) GC calls RestartEE to restart the managed threads;

7) Managed threads start running again.

In step 5) you will see that all managed threads are stopped waiting for GC to complete if you break into the debugger at that time. SuspendEE isn’t concerned with native threads so for example if the thread is calling out to some Win32 APIs it’ll run while the GC is doing it work.

For the generations we have this concept called a “budget”. Each generation has its own budget which is dynamically adjusted. Since we always allocate in Gen0, you can think of the Gen0 budget as an allocation limit that when exceeded, a GC will be triggered. This budget is completely different from the GC heap segment size and is a lot smaller than the segment size.

The CLR GC can do either compacting or non compacting collections. Non compacting, also called sweeping, is cheaper than compacting that involves copying memory which is an expensive operation.

When Concurrent GC is on, the biggest difference is with Suspend/Restart. As I mentioned Concurrent GC allows for shorter pause time because the application need to be responsive. So instead of not letting the managed threads run for the duration of the GC, Concurrent GC only suspends these threads for a very short period of times a few times during the GC. The rest of the time, the managed threads are running and allocating if they need to. We start with a much bigger budget for Gen0 in the Concurrent GC case so we have enough room for the application to allocate while GC is running. However, if during the Concurrent GC the application already allocated what’s in the Gen0 budget, the allocating threads will be blocked (to wait for GC to finish) if they need to allocate more.

Note since Gen0 and Gen1 collections are very fast, it doesn’t make sense to have Concurrent GC when we are doing Gen0 and Gen1 collections. So we only consider doing a Concurrent GC if we need to do a Gen2 collection. If we decide to do a Gen2 collection, we will then decide if this Gen2 collection should be concurrent or non concurrent.

Server GC is a totally different story. We create a GC thread and a separated heap for each CPU. GC happens on these threads instead of on the allocating thread. The flow looks like this:

1) A managed thread is doing allocations;

2) It runs out of allocations on the heap its allocating on;

3) It signals an event to wake the GC threads to do a GC and waits for it to finish;

4) GC threads run, finish with the GC and signal an event that says GC is complete (When GC is in progress, all managed threads are suspended just like in Workstation GC);

5) Managed threads start running again.

Configuration

To turn Concurrent GC off, use

</runtime>

</configuration>

in your application config file.

To turn Server GC on, use

</runtime>

</configuration>

in your application config file, if you are using Everett SP1 or Whidbey. Before Everett SP1 the only supported way is via hosting APIs (look at CorBindToRuntimeEx).

Using GC Efficiently – Part 3

In this article I’ll talk about pinning and weak references – stuff related to GC handles.

(I was planning on talking about finalization in this part of the “Using GC Efficiently” series but since I already covered it in pretty much detail in one of my previous blog entries I won’t repeat it here. Feel free to ask if you have questions not answered by that entry.)

Pinning

In a way pinning is like finalization – both exist because we have to deal with native code.

When do objects get pinned? In 3 situations:

1) Using of GCHandle of type GCHandleType.Pinned;

2) Using the “fixed” keyword in C# (and other languages might have similar things, I don’t know);

3) During Interop, certain types of arguments get pinned by Interop (for example, to marshal LPWSTR as a String object, Interop pins the buffer for the duration of the call).

For small object heap, pinning is the only user scenario that could create fragmentation (by “user scenario” I mean not by the runtime/GC itself but rather from the user code).

For large object heap, pinning is no-op as large object heap is never compacted. But you should always pin it if you want it to be pinned. As I mentioned in my previous entry, LOH always being swept is an implementation detail.

Creating fragmentation is never a good thing. It makes GC work a lot harder – instead of simply “squeezing” live objects together now it has to keep records of which live objects are pinned and try to fit objects in free spaces between pinned objects. With each release we are doing more work in mitigating issues created by fragmentated heaps.

So how do you determine how much fragmentation you have in your application? You can use the !dumpheap command in the SOS debugger extension and look for Free objects – “!dumpheap –type Free –stat” will give you the summary of all Free objects. Generally if there’s 10% or less fragmentation in the heap I wouldn’t worry about it. So when you have a big heap don’t panic if you see the absolute number of bytes of Free objects being high but is less than 10%. Looking at the objects after the Free objects could give you a clue who is pinning.

When you do need to pin, here are some things to keep in mind:

1) Pinning for a short time is cheap.

How short is “a short time”? Well, if there’s no GC happening, pinning simply sets a bit in the object header and unpinning simply clears it. But when GC happens, we have to make sure to not move the pinned objects. So “pinning for a short time” means when GC doesn’t notice this object was pinned. This in turn means when you pinned some objects, and before you unpin it there’re not much, if any, allocations going on.

2) Pinning an older object is not as harmful as pinning a young object.

By “an older object” I mean an object that has had the chance to migrate to Gen2 and being compacted into a relatively stable location.

3) Creating pinned buffers that stay together instead of scattered around. This way you create fewer holes.

A couple of examples of good techniques:

1) Allocate a pinned buffer in LOH and give out a chuck at a time.

The downside is “chucks” are not objects and there are very few APIs that accept non objects.

2) Allocate a pool of small object buffers and hand them out when needed.

For example, I have a pool of buffers, and method M takes a byte array which needs to be pinned. If the buffer is already in Gen2 it’s ok to pin it. The idea is hopefully your method doesn’t need to use the buffer for long so the buffer will be free in a Gen0 or Gen1 collection. When the buffer is not a Gen2 buffer you get a buffer from your buffer pool to use – all buffers in the buffer pool by now are most likely all in Gen2 anyway:

void M(byte[] b)

{

if (GC.GetGeneration(b) == GC.MaxGeneration)

{

RealM(b);

}

// GetBuffer will allocate one if no buffers

// are available in the buffer pool.

byte[] TempBuffer = BufferPool.GetBuffer();

RealM(TempBuffer);

CopyBackToUserBuffer(TempBuffer, b);

BufferPool.Release(TempBuffer);

}

Weak References

How are weak references implemented?

A weak reference has a managed part and a native part. The managed part is the WeakReference class itself. In its constructor we ask to create a GC handle which is the native part of it – it inserts an entry in that AppDomain’s handle table (note that GCHandle’s are all created this way – they are just inserted as their respective types). The object that the weak reference refers to will die when there are no strong references point to it. Since the weak reference is a managed object itself, it will be freed like any other managed objects.

This means if you have a very small object, let’s say one DWORD field, the object will be 12 bytes (size of a mininal object). If you have a WeakReference which has an IntPtr and a bool field, and the GC handle which is a pointer size, you are paying more than the object size to refer to the object with a weak reference. So obviously you don’t want to get into a situation where you are creating many weak references to refer to a small object.

What are the uses of weak references?

1) Keeping an object alive temporarily or clean up when an object gets collected.

Why would you use weak references to watch objects to clean up, instead of using a finalizer? The advantage is that the object being watched isn’t promoted like it would be if it had a finalizer; the disadvantage is that it’s more expensive in terms of memory consumption and the clean up happens only when the user code checks on the object the weak reference points to being null.

Option A):

class A

{

WeakReference _target;

MyObject Target

{

set

{

_target = new WeakReference(value);

}

get

{

Object o = _target.Target;

if (o != null)

{

return o;

}

else

{

// my target has been GC'd - clean up

Cleanup();

returnnull;

}

void M()

{

// target needs to be alive throughout this method.

MyObject target = Target;

if (target == null)

// target has been GC'd, don't bother

return;

else

{

// always need target to be alive.

DoSomeWork();

}

GC.KeepAlive(target);

}

Option B):

class A

{

WeakReference _target;

MyObject ShortTemp;

MyObject Target

{

set

{

_target = new WeakReference(value);

}

get

{

Object o = _target.Target;

if (o != null)

{

return o;

}

else

{

// my target has been GC'd - clean up

Cleanup();

returnnull;

}

void M()

{

// target needs to be alive throughout this method.

MyObject target = Target;

ShortTemp = target;

if (target == null)

{

// target has been GC'd, don't bother

return;

}

else

{

// could assert that ShortTemp is not null.

DoSomeWork();

}

ShortTemp = null;

}

2) Maintaining a cache.

You can have an array of weak references:

WeakReferencesToObjects WeakRefs[];

Where each item in the array references an object by a weak reference. Periodically we could go through this array and see which objects are dead and release the weak references for those objects.

If we always get rid of the weak references when the objects they refer to are dead, the cache will be invalidated on each GC.

If that’s not sufficient for you, you can use a 2-level caching mechanism:

· Maintain a strong reference for the cached items for x amount of time;

· After x amount of time is up, convert the strong references to weak references. Weak references will be considered to be kicked out of the cache before strong references are considered.

You could have difference policies for the cache such as based on the times the cached items are queried – the ones that are queried less often are maintained by or converted to weak references; or based on the number of items in the cache, maintain weak references to the overflown items when the cache is bigger than a certain size. It all depends on your cache usage. Tuning caches is a whole other topic on its own – perhaps some day I will write about it.