Thursday, November 25, 2010

Optimizing Performance On The Xbox 360 - Part 2

In part 1 I described the reasons why Xbox 360 is slower than an equivalent PC and  some of the basic recommendations for optimizing C# code to make it run faster. In this post I will mention a few other tricks to increase performance on the Xbox 360.

Structures… again.

In the .NET world, objects are allocated on the heap and structures are allocated on the stack. Remember, only the heap is garbage collected, this means that structures are not subject to the expensive garbage collection operation. Allocation on the stack is basically free as we only need to increase a stack pointer and call a constructor. This means you can reduce garbage collection and increase speed by using structures instead of objects – however, you need to know the difference between the two. You should also follow the general recommendations for using structures:

  • Small data objects (data containers)
  • Often destroyed or created
  • Read-only (or mostly-read) objects

Reuse those objects

As mentioned before, objects are allocated on the heap that is garbage collected. That is why you should avoid creating a lot of objects on each update. Instead, you should reuse the objects using pools as explained in part 1. Another and simpler way is to move the object to outer scope and/or make it static. One object that you should pay close attention to is the SpriteBatch (if you are using XNA). Do not create the SpriteBatch object on each update, but keep a reference to it through out the game.

You should also pay attention to List<T>, arrays and other commonly used list types. When you create a new List<T> and add a single item to it, it will initialize an array with a size of 4. If you add more than 4 elements, it will create a new array with double the size. The results is that 2 arrays have been garbage collected. To keep that from happening, you should initialize List<T> with a capacity that is equal to what you need. To keep the List<T> form being garbage collected, you should move the list to outer scope and clear it before use instead of initializing a new instance.

Boxing

Boxing is the process of encapsulating a value type in a reference type container. This is great as it allows us to pass value types as objects to methods or store them in collections as objects. However, this is not great for performance and we have to eliminate boxing and unboxing where we can. You can use tools like Reflector or Ildasm to disassemble your application to MSIL code. Inside the MSIL code you simply look for the box and unbox keywords. You can find more info along with pictures in my post Performance Tips and Tricks Part 4.

Further reading

I recommend you read two GDC 2008 PowerPoint presentations. CLR Performance by Frank Savage and Understanding XNA Framework Performance by Shawn Hargreaves.

Sunday, November 21, 2010

Optimizing Performance On The Xbox 360 - Part 1

After my Performance Tips and Tricks series I got a lot of requests to write a post regarding the performance on the Xbox 360 platform. I will try to focus on why Xbox 360 is slower than PC and what we can do to get around it. But first we need to underline the importance of good programming.

Algorithms and datastructure complexity

I’ve been on the topic of complexity before. I can’t underline how important it is that we keep the complexity of our algorithms low. If you are sorting a large dataset, make sure you use the correct algorithm and if you are sorting a small dataset (<10 items), there is another and faster algorithm.

Note: If you use .Net List<T>.Sort(), it automatically chooses the best algorithm depending on your dataset.

If you are using pixel perfect collision detection, perhaps you should implement a physics system that uses a polygon collision system instead. You could also implement a broad phase system to filter out collisions that you don’t need to process. Implementing a fast and efficient algorithm is 100 times better than any micro-optimization or performance trick out there. But this post is not about using efficient algorithms, it is about the case where you already have implemented the algorithms and still need to squeeze the last drops of performance out of the platform.

Why is it slow?

Xbox 360 uses a custom triple-core 64bit PowerPC 3.2 GHz CPU. It is designed by IBM with high floating point performance in mind by using multiple FPU and SIMD vector processing units in each core. It is a quite powerful platform, but an equivalent PC still outperforms it, how can that be?

I’m afraid that the hardware is not the problem here, it is the software on the Xbox 360. Let us take an example; the following piece of code runs in 12500 ticks on Xbox 360 and 10 ticks on PC.

code1

That is a huge difference and even if we switch to integers instead of floats (to get around the FPU) we still get around 7060 ticks on Xbox 360 and 10 ticks on PC. The low performance when using floats is due to missing AltiVec support in .Net Compact Framework and some missing optimizations in the floating point code. The .Net Compact Framework was not designed to run on Xbox 360. In fact, it was designed with portability in mind and did not have any floating point code until the Xbox was released. Thankfully the .Net CF team implemented some FPU code and that improved the floating point performance a lot (by a factor of 10!).

Then there is the garbage collector. The CF has a simplified version of the garbage collector that works without generations. Instead it looks up all the objects after 1 MB has been allocated. There are also a few other differences such as code pitching where the GC frees code it has jitted before to free up memory. And last but not least there are the compiler optimizations performed by the CF JIT compiler. The optimizations carried out by the CF JIT is severely limited, especially the method inlining. The CF JIT will only inline a method based on the following criteria:

  • 16 bytes or less of IL
  • No branching (if statements)
  • No local variables
  • No exception handlers
  • No 32-bit floating point arguments or return value
  • If it has more than one argument, they must be accessed in order from lowest to highest.

That means it is basically only inline property getter/setters and methods that call other methods.

The missing FPU code, the few optimizations and the simple garbage collector is all because the CF never was designed to work on anything else but embedded devices such as mobile phones. Lets take a look at what options we have…

Optimizing the code

It is very important that you start by profiling your code. You can use the XNA Remote Performance Monitor to profile memory allocations (garbage collector problems) and you can use the Stopwatch class to profile different areas of your code to see which area you should focus on. A normal PC profiler also comes in handy as the areas that tend to take up the most time on PC, almost always are the same areas on the Xbox 360.

gc

When you have found the different areas to focus on, we can move on to actually optimizing them.

Virtual methods
One of the places to optimize are the virtual methods and properties. While virtuals are  great for OOP, they are not so great for the CF JIT compiler. Virtual methods are around 40% slower than static or instance calls and just to make it all worse, they are never inlined by the compiler.

virtual1

To get around it you need to avoid virtuals where you can. If you need to use virtuals, make sure you mark your overridden methods and classes as sealed. If you use the sealed keyword, the compiler can sometimes resolve the destination of a virtual call and avoid the expensive virtual call.

Use fields instead of properties
While properties are a neat way to keep your code clean, they are not that great performance wise. Under the hood, properties are ordinary methods and while simple get/set properties gets inlined by the compiler, there are no guarantees that all of them will get inlined. The solutions is to use fields instead, but not at the costs of a poorly designed application. So use it wisely!

field

The ref keyword
If you are working with structs, you should try and use the ref keyword to send them by reference instead of by value. If you are working with the XNA library, the Matrix, Vector2 and Vector3 classes have math methods that uses ref and out to pass structs by reference instead of by value.

ref

Inlining
Manually inlining code when using the Compact Framework can give you an extra performance boost. The gain is typically around 5-15% depending on your code. The tricky part of inlining is to determine what code to inline. I recommend you follow some simple rules:

  • Small methods
  • Methods that are only called from one place
  • Methods that does not alter the state of the object (only give input and get output methods)
  • Methods that does not already get inlined by the compiler

It is important to run a profiler while you inline. You might get even worse performance by inlining the wrong methods. It is important to measure it!

inline

Pooling
In the CLR, creating an object is a really efficient operation while destroying the object is not. This is due to the fact that destroying an object will make it subject to the garbage collector. A way to get around this is to use pooling of objects. You basically create a pool (you could use a stack) and then pre-allocate a bunch of empty objects before use. When you need an object, you fetch one from the pool and when you are done with it, you put it back in the pool. This way you keep the object reference alive and it will get collected by the garbage collector.

Fixed point math
I’ve been on this subject before. If you have some heavy floating point operations that you need to run on a platform with low floating point performance, then you should implement fixed point math. A trick is to identify operations that don’t need floating point accuracy (like my code above in the explanation) and convert it to use integers instead.

Update: Part 2 is here.

Friday, November 12, 2010

Farseer Physics Engine 3.1 released

I’ve just packaged and released Farseer Physics Engine 3.1. It has a lot of new features and a couple of bugs have been fixed. You can see a list of changes in the new version on the downloads page.

Sample