-
Notifications
You must be signed in to change notification settings - Fork 14
Engine optimizations, part 2
Reducing the amount of Input/Output operations is one of many types of optimization that can be implemented to make a program run more efficiently. Optimising input and output operations minimises the amount of time the CPU is left waiting for data. Data access is only instantaneous when it is stored in a CPU register. Data access from anywhere else involves bus cycles and has a period of latency (time/delay) associated with it. Depending on the type of storage media latency can vary from nanoseconds to milliseconds but overall it will still be considerably slower than that of a single clock cycle on a CPU. This delay can cause the CPU pipeline to stall while it is waiting for data.
The Sega 32X’s access time in relation to storage type/peripheral looks like this:
The chart above highlights how hardware registers and CPU cache are the most optimal types of data storage, offering near equal access times. The 32x has only a handful of registers, along with 4KiB of cache within each SH-2 CPU. These hardware resources are small in size and not suitable for storing large amounts of data such as textures, models and sprites for long periods of time.
The best option for storing data of this type is SDRAM (or simply RAM) which offers slightly higher access averages. The 32X only has 256KiB of SDRAM that is accessible via its CPUs.
In addition to SDRAM the game cartridge itself can provide up to 4MiB ROM. This can be increased further with bank switching. The 32x cannot directly access the Mega Drive's hardware such as RAM and VRAM but the Mega Drive's CPU can pass data between the two.
In summary, the 32x has a small amount of fast RAM (SDRAM) to store a program and a large amount of slow ROM. To overcome the slow ROM tactical decisions must be made regarding which resources to cache in SDRAM.
There is another caveat to this. The SDRAM of the 32x is hardwired for burst reads. This means that each time a byte or word is fetched from the RAM address space, the whole cache line, which consists of a batch of 8 words (16 bytes) is fetched. There are a number of implications to this:
- burst reads are atomic/uninterruptible
- a cache miss results in 12 bus cycles penalty
- linear reads are fast and take 1.5 cycles per word on average
- random reads that stretch beyond cache line boundaries are slow
- cache-through reads from SDRAM are slow: the hardware always reads all 8 words from the cache line, tossing away the unneeded words
These issues may seem like a hindrance for using the SDRAM for texture caching in Doom but it's actually the second most important optimization found in DOOM 32X Resurrection and here’s why.
As mentioned previously the SDRAM favours sequential reads in small increments of 8 words (16 bytes). If the CPU attempts to read SDRAM from a number of memory locations instead of sequentially, performance is worse than reading from the cart ROM, the slowest storage type available on the 32X. When designing the SDRAM texture cache this had to be taken into consideration to ensure the draw routines avoided making random jumps across texture data.
So how was this achieved? Luckily, it was pretty easy in the case of wall textures. The walls in DOOM can not be sloped and are alw ays perpendicular to floors and ceilings meaning the gradient vector used to calculate the position of the next pixel drawn, has only one coordinate. So instead of doing this for each pixel:
u = u + ufrac;
v = v + vfrac;
c = v * width + u;
pixel = texture[c & mask];
We can do this instead:
u = u + ufrac;
c = u;
pixel = texture[c & mask];
Which simply translates to linear reads from SDRAM, which is exactly what we want. All we need to do is to have our wall textures stored in column major order (or rotated 90 degrees to the left). Luckily for us, the DOOM programmers already took care of that!
Visplanes that represent floors and ceilings in DOOM are not rendered as columns, so the method we use to optimise reads for wall textures isn’t going to be as effective. On the plus side, floor and ceiling textures (or simply “flats”) have a fixed size of 64x64 pixels (4 KiB), the exact same size as the cache found in both of the 32Xs CPUs. Even if the flat is not in the CPU cache prior to the first draw, we can still make use of the cache if we’re about to draw the same flat again. Floors and ceilings in DOOM often use repeating tiles of the same flat, and we don’t really care about the draw order as we do with walls. All we need to do is sort our visplanes by their corresponding flat texture ID before drawing them, which increases the odds of the CPU encountering the same flat on subsequent visplane draws.
int i, numplanes;
visplane_t* pl;
uint16_t *sortbuf = sortedvisplanes;
// encode (flat ID, visplane number) pairs as 32-bit integers
i = 0;
numplanes = 0;
for (pl = visplanes + 1; pl < lastvisplane; pl++)
{
sortbuf[i + 0] = pl->flatnum;
sortbuf[i + 1] = pl - visplanes;
i += 2;
numplanes++;
}
// call the insertion sort function
D_isort((int*)sortbuf, numplanes);
The 32x only has 256KiB of SDRAM at its disposal. This RAM is used to hold the program’s state and after initialisation what RAM is left is used as a texture cache.
D32XR, like the original DOOM, uses a custom zone memory allocator. The engine has a dedicated memory pool of 200KiB, allocated for this purpose:
static char zone[0x32000] __attribute__((aligned(16)));
After loading a level, the engine checks if there is sufficient unused memory to allocate to at least one flat texture and bookkeeping structures. The engine also ensures that there’s a window of at least 12KiB of free memory for things like game objects, projectiles.
const int zonemargin = 12*1024;
const int flatblocksize = sizeof(memblock_t) + ((sizeof(texcacheblock_t) + 15) & ~15) + 64*64 + 32;
zonefree = Z_LargestFreeBlock(mainzone);
if (zonefree < zonemargin+flatblocksize)
goto nocache;
Due to the limited amount of memory, we need to make sure that only the “important” textures are cached. D32XR does this by computing the total number of screen pixels each unique texture occupies in the total image. A simple lookup table is used to store the results.
This metric is used to prioritise which order textures are cached. Which textures have already been cached is also tracked to avoid flooding the cache with identical texture data.
D32XR also ensures that textures that occupy only a small area of the screen are not cached in SDRAM due to their limited usage - since the engine doesn’t use mipmapping, textures are always cached at full resolution.
Finally the engine caches only the highest priority texture on each frame. This prevents the game from pausing for long periods of time, while it is busy copying texture data from ROM to RAM.
Unfortunately it does not matter how optimised your allocation policy is, you will eventually run out of cache space and memory and there will be the need to free up space to make room for new entries. There are a number of algorithms dedicated to solve this issue known as cache-replacement policies. Some of the more commonly known policies include LRU (Least recently used), MRU (Most recently used), LFU (Least frequently used), MFU (Most frequently used) and FIFO (First in, First out).
D32XR keeps track of the number of pixels for both cached and uncached textures and then makes smart decisions as to which entries in the cache it needs to replace, if any at all. The engine continuously adds new textures to the cache. Each time a new texture is added, an associated integer value which is called the “life counter”, is set to the maximum value.
When cache space is eventually exhausted, and the engine needs to make room for a new high-priority texture, it looks through the existing entries to see if there are any of less priority than the new texture. The life counter of entries with less priority are decremented and any that have hit a value of 0 are released from memory. The life counter enables some of the textures that are not visible to the player to be kept in memory, since we might need them a matter of moments later: in a 3D shooter, it is normal for the player to make 180 degree turns or travel between rooms back and forth. Furthermore DOOM features animated textures, which are often used to indicate damaging floors. These are tied to the game’s timer, and it would be suboptimal to cyclically replace animation frames based on the timer value. Overall, the approach used is a hybrid less frequently used (LFU) policy.
The game has a special debug mode, which turns off the texture cache and forces the game to read textures directly from cartridge ROM. The mode makes it easier to measure how the SDRAM cache affects performance. The mode displays a series of counters each of which represents a render stage and the time that render stage has taken to complete in milliseconds. The total number of milliseconds is represented by the ‘r’ counter. The screenshots below were taken on an NTSC system, with the in-game screen resolution set to 160x180.
Texture cache enabled | Texture cache disabled | Gain (ms) |
---|---|---|
| 80-58=22 | |
75-57=18 | ||
73-52=21 | ||
59-43=16 | ||
53-44=9 |
The screenshots above show how the texture cache increases the speed of the rendering process in a typical Doom scene by 9-24 milliseconds, or by 13-28% on average. The simpler the scene, the less performance is gained as there is always a constant rendering cost (e.g. updating each pixel on the screen each frame is guaranteed to take a certain amount of CPU cycles).
Note that on an NTSC system with a refresh rate of 60Hz, a single display frame takes 16.666ms, so any improvement above that number will provide a substantial frame rate increase. In the case of D32XR, this helps bring the average frame rate from lower to playable upper tens at fullscreen resolution of 160x180.
Please refer to the source code for more information: https://github.com/viciious/d32xr/blob/master/r_cache.c
Written by Victor Luchits (Vic), 2023
Edited by Matt Bennion (MatteusBeus)
Screenshots by Joseph Fenton (Chilly Willy)