Tech —

Understanding CPU caching and performance

An introduction to the concepts of CPU caching and performance.

The memory hierarchy

I'm sure you've figured it out already, but the warehouse in our analogy is the level 1 (or L1) cache. The L1 can be accessed very quickly by the CPU, so it's a good place to keep the code and data that the CPU is most likely to request, next. (In a moment, we'll talk in more detail about how the L1 can "predict" what the CPU will probably want.) The L1's quick access time is a result of the fact that it's made of the fastest and most expensive type of static RAM, or SRAM. Since each SRAM memory cell is made up of four to six transistors (compared to the one-transistor-per-cell configuration of DRAM) its cost-per-bit is quite high. This high cost-per-bit means that we generally can't afford to have a very large L1 cache unless we really want to drive up the total cost of the system.

In most current CPUs the L1 sits on the same piece of silicon as the rest of the processor. In terms of the warehouse analogy, this is a bit like having the warehouse on the same block as the workshop. This has the advantage of giving the CPU some very fast, very close storage, but the disadvantage is that now the main memory (our suburban warehouse) is just as far away from the L1 cache as it is from the processor. So if some data that the CPU needs isn't in the L1, a situation called a cache miss, it's going to take quite a while to retrieve that data from memory. Furthermore, remember that as the processor gets faster, the main memory gets "further" away all the time. So while our warehouse may be on the same block as our workshop, the lumber yard has now moved not just out of town but out of the state. For an ultra-high-clockrate processor like the P4, being forced to wait for data to load from main memory in order to complete an operation is like our workshop having to wait a few days for lumber to ship in from out of state.

Check out the table below, which shows common latency and size information for the various levels of the memory hierarchy. (These numbers are a bit old. The current numbers are smaller and shrinking all the time). Notice the large gap in access times between the L1 cache and the main memory. For a 1 GHz CPU a 50 ns wait means 50 wasted clock cycles. Ouch. To see the kind of effect such stalls have on a modern, hyperpipelined processor, check out Part I of my P4 vs. G4e comparison.

The solution to this dilemma is to add more cache. At first you might think we could get more cache by enlarging the L1, but as I said above the cost considerations are a major factor limiting L1 cache size. In terms of our workshop analogy, we could say that rents are much higher in-town than in the suburbs, so we can't afford much warehouse space without the cost of rent eating into our bottom line to the point where the added costs of the warehouse space would outweigh the benefits of increased worker productivity. So we have to fine tune the amount of warehouse space that we rent by weighing all the costs and benefits so that we get the maximum output for the least cost.

A better solution than adding more L1 would be to rent some cheaper, larger warehouse space right outside of town to act as a cache for the L1 cache. Hence processors like the P4 and G4e have a Level 2 (L2) cache that sits between the L1 cache and main memory. The L2 cache usually contains all of the data that's in the L1 plus, some extra. The common way to describe this situation is to say that the L1 cache subsets the L2 cache, because the L1 contains a subset of the data in the L2. In the picture below, the red cells are the code and data for the process that the CPU is currently running. The blue cells are unrelated to the currently active process. (This diagram will become more clear once we talk about locality of reference.)

This series of caches, starting with the page file on the hard disk (the lumber yard in our analogy) and going all the way up to the registers on the CPU (the workshop's workbenches), is called a cache hierarchy. As we go up the cache hierarchy towards the CPU, the caches get smaller, faster, and more expensive to implement; conversely, as we go down the cache hierarchy they get larger, cheaper, and much slower. The data contained in each level of the hierarchy is usually mirrored in the level below it, so that for a piece of data that's in the L1 cache there are usually copies of that same data in the L2, main memory, and page file. So each level in the hierarchy subsets the level below it. We'll talk later about how all of those copies are kept in sync.

Level Access
Time
Typical
Size
Technology Managed
By
Registers 1-3 ns ?1 KB Custom CMOS Compiler
Level 1 Cache (on-chip) 2-8 ns 8 KB-128 KB SRAM Hardware
Level 2 Cache (off-chip) 5-12 ns 0.5 MB - 8 MB SRAM Hardware
Main Memory 10-60 ns 64 MB - 1 GB DRAM Operating System
Hard Disk 3,000,000 -
10,000,000 ns
20 - 100 GB Magnetic Operating System/User

?

As you can see from the above table, each level of the hierarchy is controlled by a different part of the system. Data is promoted up the hierarchy or demoted down the hierarchy based on a number of different criteria, but in the remainder of this article we'll only concern ourselves with the top levels of the hierarchy. Also, again note that these numbers are kind of old. I cobbled them together from a chart I found on the web (check the bibliography) and the Hennessy and Patterson book. More detailed numbers for P4 and G4e-based systems will be forthcoming in the next installment.

Channel Ars Technica