Cache Me If You Can — Updated

A more in depth look at caching and when to use it written in python

This article is a more updated version of this article.

You can find the code for this article here.

Caching is a wonderful tool in a modern world where memory is cheap and computation is less so. It is found in literally every level of our computer, from the hardware caches of the L1–3, and TLB of each of our CPU’s, the page cache we use in the operating system level, to the very browser you are viewing this on. Caching is obviously something incredibly useful, but understanding when we can or can’t use them is incredibly important. A cache that has only a hit percentage of <50% doesn’t seem as useful, and can incur quite an overhead in memory. At the CPU level and depending on the level of multi-threading even caches of <80% are inefficient, and this thrashing of a cache can cause major slowdown in the higher levels. Thrashing is the constant removal of useful data from the cache, only to be re-added some short time later.

I’m going to go over some common caching algorithms used, and just like the first iteration of this article, I have provided a general use case decorator in python in which you can dictate the size, and style of cache you wish to use. As well as a global cache manager that can be used to monitor the caches you attach to functions. I will first go over each of the implementations.

All these caches are set size caches — unlike my previous decorator, each of these algorithms are created with a different style of eviction policy. More concisely, when our cache is full, which entry should we get rid of?

This is the simplest of the caching algorithms, its a queue! As entries enter the cache, they are pushed to the front of the queue and once the cache needs more space, the front is popped off! Its simplicity is its strength, but its weakness is it lacks the ability to adapt. It suffers when any sort of randomness is involved.

That being said all caching algorithms suffer when its completely random. If you have some massive amount of data that you can’t possibly cache, and there is no pattern in how you access this data, then it’s just a guess on which cache entries to keep! At that point you may as well not even use a cache. Luckily, there are many forms of the data we consume which has locality to it, meaning if a program accesses an entry once, we probably want it again, as well as we are more likely to access pieces of data close by (either in memory or some logical form of closeness). Caches — and many other algorithms — take advantage of this data locality.

Back to FIFO — our weakness is that no entry is more important than the others. Since data we are deciding to cache has some locality to it, this is obviously not true! We want to try and estimate the importance of entries. FIFO just does not do that, so when any random element comes in, it most likely will eject an entry that is more important. This entry will exist till it reaches the head of the queue and is ejected, this reduces our effectiveness for that time by 1/N (where N is the size of our cache) for each element that is random.

Least recently used is a very common cache style found within many areas of the systems stack. Its work in the same style as FIFO, where you have some linked list that forms a queue, but on a hit — you take the element out and put in in the back of the queue. In this way we preserve its life in the queue, and naturally the elements that have not been used in awhile will be evicted! Simple, and easy to implement, and maintains locality of data while having a simple metric for keeping important entries. An all around great algorithm to use. There is a cost at the lowest levels though where we do want actions on a cache to be O(1) time, but the cost is in hopefully are most common case — the hit. During a hit we must change the data structure by modifying a bunch of pointers. Now if you have a cache that’s being used millions of times a second, this hit cost can add up significantly and hurt performance. But in the highest layers of applications this cost is not as significant as in my Python example, where there already is a bunch of overhead for the language use itself!

Clock is an algorithm that actually approximates LRU! Clock was introduced to perform as well as a cache as LRU, but not have the hit cost. It works by using a circular list and some variable known as the clock needle pointing some element in the list. When an entry is stashed in the cache, its placed in the list with a bit set at 1. If an eviction is needed, the we will look at where the clock needle is pointing, check the bit of the entry, if its at 1 then we set it to 0 and move the needle one element over, if its at 0, we evict it. This is known as giving the entry a second chance, the algorithm sees that its not useful right now but we wish to give it a second chance. During a hit, we set the bit of the entry to 1.

That’s it! Just like LRU this algorithm is quite simple to implement and works quite well at approximating LRU in its usage. The greatest strength is this approximation with the lack of the LRU hit cost. We can see during a hit, that all we have to do is update the bit which is a much lower cost than updating an entire data structure.

The most complicated cache I will be covering. This cache is composed of two data caches (T1, T2), and two ghost entry caches (B1, B2). T1 is a modified FIFO cache that acts as a cache for recently used entries, T2 is an LRU cache thats acts as a cache for the most frequently used entries. B1 and B2 are caches that just hold keys but no data, they are used to allow the cache to adapt the sizes of T1 and T2. During a fetch call all caches are searched for the key, and miss or hit in the individual caches dictates how the cache will adapt. Adaption of the ARC cache occurs through a sliding window concept. Where the size of the window is the actual size of your cache. The window will slide based on hits within the B1 and B2 caches.

Adaptive Replacement Cache

T1 Hit : This will move the element to the T2 cache, T2 caches are LRU so they will evict an element and place its key within the B2 cache.

T2 Hit : Standard LRU hit, this will move the element to the back of the queue to preserve it within the cache.

B1 Hit : This signals that we currently are not allocating enough space in the T1 cache, so we should move our window to the left! This counts as a miss as B1 only holds ghost entries. B1 acts as a way to see if we made the right choices in evicting entries. The moving of the window is just an increasing of the max size of the T1 cache, and a lowering of the max size of the T2 cache, followed by an eviction on the T2 cache if needed.

B2 Hit : This is the same logic as B1 hits, but instead we should be allocating more for the T2 cache and so we should move the window to the right.

Again, you can find a working demo and implementation of all the caches here. These are all in memory caches, I’ll possibly implement these as file caches later — but this is an easy extension. Try and think of the costs associate with using a file cache vs in memory (cost of on disk vs in memory).

Conclusion

There are many styles of caches, and much more then even the ones I mentioned here! Each offer some benefit or weakness over other caches. I hope with this article you found a foundation of knowledge around proper use of in memory caches. Remember caches are very powerful tools that require a good amount of thought to understand when and where to use, as well what style you require. Make sure to check the hit rates of your caches!

Performance Notes

There’s lots of cool issues when you start multi-threading these caches (if you need multi-threaded I would implement these in C++ or C). For example looking at LRU we require a highly concurrent queue, the problem with structures like these is that they enforce order, and order enforcing data structures are contradictory to concurrent data structures. This is because to enforce order requires synchronization on all the threads, this synchronization creates a bottle neck at the head of your queue. You’ll end up having to lock the data structure (which may be fine for your needs) to make changes, causing you to have worse than single threaded performance on all threads. If you want a highly performing concurrent data structure you need to relax you standards for order.

Its hard to talk about performance without addressing the language choice of python. I believe python is a fantastic tool for laying out and blueprinting algorithms, but if you are requiring a highly optimised cache, you may want to look into the ones implemented in C.

Anyways hopefully you found this article informative, and if you find any bugs, have any questions, or just have suggestions on topics you would like to me to cover, let me know!

Cheers!

My goal is to share my life, experiences, knowledge, and passion with anyone who cares to read. Currently a PhD student at the University of Waterloo.