Cache memory is a small (in size) and very fast (zero wait state) memory which sits between the CPU and main memory. Unlike normal memory, the bytes appearing within a cache do not have fixed addresses. Instead, cache memory can reassign the address of a data object. This allows the system to keep recently accessed values in the cache.
Using the Principle of Locality to improve performance while keeping the memory system affordable we can pose 4 questions about any level of memory hierarchy and we will answer those questions considering one level of memory hierarchy for e.g. cache in our case.
A number of hardware schemes have been developed for translating main memory addresses to cache memory addresses. The user does not need to know much about the address translation circuitry, which has the advantage, that cache memory enhancements can be introduced into a computer without a corresponding need for modifying application software.
Basically number of cache lines are very less than the number of main memory blocks. As a result an algorithm is needed for mapping main memory blocks into cache lines. Also a means is needed for determining which main memory block currently occupies a cache line.
The choice of cache mapping scheme affects cost and performance, and there is no single best method that is appropriate for all situations. There are three methods in block placement namely
In general a cache has two important parts; the cache data line and the cache tags. But in granular it can shown as below
Since a cache is typically smaller than an entire address space, there is a possibility that any particular requested data is not present in the cache. Therefore there must be some mechanism to determine whether any requested data is present in the cache or not. The tags fill this purpose.
Cache has an address tag on each block frame that gives the block address. Therefore tag entry of every cache block is checked to see if it matches the block address from the CPU. As a rule, all possible tags are searched in parallel because speed is critical.
There must be a way to know that a cache block does/doesn’t have valid information. The most common procedure is to add a valid bit to the tag to say whether or not this entry contains a valid address. If the bit is not set, there cannot be a match on this address. Accordingly an address, generated by CPU (or main memory address) is divided as shown below:
The first division is between the block address and the block offset. The block frame address can be further divided into the tag field and the index field. The block-offset field selects the desired data from the block, the index field selects the set, and the tag field is compared against it for a hit. Although the comparison could be made on more of the address than the tag, there is no need because of the following:
If the total cache size is kept the same, increasing associativity increases the number of blocks per set, thereby decreasing the size of the index and increasing the size of the tag.
When a miss occurs, the cache controller must select a block to be replaced with the desired data. A benefit of direct-mapped placement is that hardware decisions are simplified – in fact, so simple that there is no choice: Only one block frame is checked for a hit, and only that block can be replaced. With fully associative or set-associative placement, there are many blocks to choose from on a miss. There are three primary strategies employed for selecting which block to replace:
Random – To spread allocation uniformly, candidate blocks are randomly selected. Some systems generate pseudorandom block numbers to get reproducible behavior, which is particularly useful when debugging hardware.
Advantage : simple to implement in hardware
Disadvantage : ignores Principle of Locality
Least-recently used (LRU) – To reduce the chance of throwing out information that will be needed soon, accesses to blocks are recorded. Relying on the past to predict the future, the block replaced is the one that has been unused for the longest time. LRU relies on a corollary of locality: If recently used blocks are likely to be used again, then a good candidate for disposal is the least-recently used block.
Advantage : takes locality into account
Disadvantage : as the number of blocks to keep track of increases, LRU becomes more expensive (harder to implement, slower and often just approximated).
First In First Out (FIFO) – Because LRU can be complicated to calculate, this approximates LRU by determining the oldest block rather than the LRU.
Interaction Policies with Main Memory
Basically READ operations dominate processor cache accesses since many of the instruction accesses are READ operation’s and most instructions do not WRITE into memory. When the address of the block to be READ is available then the tag is read and if it is a HIT then READ from it.
In case of a miss the READ policies are:
Basically a Miss is comparatively slow because they require the data to be transferred from main memory to CPU which incurs a delay since main memory is much slower than cache memory, and also incurs the overhead for recording the new data in the cache before it is delivered to the processor. To take advantage of Locality of Reference, the CPU copies data into the cache whenever it accesses an address not present in the cache. Since it is likely the system will access that same location shortly, the system will save wait states by having that data in the cache. Thus cache memory handles the temporal aspects of memory access, but not the spatial aspects.
Accessing Caching memory locations won’t speed up the program execution if we constantly access consecutive memory locations (Spatial Locality of Reference). To solve this problem, most caching systems read several consecutive bytes from memory when a cache miss occurs. 80×86 CPUs, for example, read between 16 and 64 bytes at a shot (depending upon the CPU) upon a cache miss. If you read 16 bytes, why read them in blocks rather than as you need them? As it turns out, most memory chips available today have special modes which let you quickly access several consecutive memory locations on the chip. The cache exploits this capability to reduce the average number of wait states needed to access memory.
It is not the same with Cache WRITE operation. Modifying a block cannot begin until the tag is checked to see if the address is a hit. Also the processor specifies the size of the write, usually between 1 and 8 bytes; only that portion of the block can be changed. In contrast, reads can access more bytes than necessary without a problem.
The Cache WRITE policies on write hit often distinguish cache designs:
- Write Through – the modified data is written back to both the block in the cache memory and in the main memory.
1. READ miss never results in writes to main memory.
2. Easy to implement
3. Main Memory always has the most current copy of the data (consistent)
- WRITE operation is slower as we have to update both Main Memory and Cache Memory.
- Every write needs a main memory access as a result uses more memory bandwidth
- Write Back – the modified data is first written only to the block in the cache memory. The modified cache block is written to main memory only when it is replaced. In order to reduce the frequency of writing back blocks on replacement, a dirty bit (a status bit) is commonly used to indicate whether the block is dirty (modified while in the cache) or clean (not modified). If it is clean the block is not written on a miss.
- WRITE’s occur at the speed of the cache memory.
- Multiple WRITE’s within a block require only one WRITE to main memory as a result uses less memory bandwidth
- Harder to implement
- Main Memory is not always consistent with cache reads that result in replacement may cause writes of dirty blocks to main memory.
Incase of Cache Write MISS we have to options.
- Write Allocate – the memory block is first loaded into cache memory from main memory on a write miss, followed by the write-hit action.
- No Write Allocate – the block is directly modified in the main memory and not loaded into the cache memory.
Although either write-miss policy could be used with write through or write back, write-back caches generally use write allocate (hoping that subsequent writes to that block will be captured by the cache) and write-through caches often use no-write allocate (since subsequent writes to that block will still have to go to memory).
The data in main memory being cached may be changed by other entities, in which case the copy in the cache may become out-of-date or stale. Alternatively, when the CPU updates the data in the cache, copies of data in other caches will become stale. Communication protocols between the cache managers which keep the data consistent are known as cache coherence protocols.