Caching is used everywhere in order to speed up processing. Its used right from CPU level to a layer in front of databases. Cache invalidation or when to remove something from cache, is very interesting and complex problem to solve. I am going to talk about something simple than that.
Issue am mentioning in this post is very simple, yet it took almost 1.5 years to come to the surface. It was hidden because of how the other parts of the code were written. It came to surface as more users started using they tried something apart from recommended approach. I will explain what it was later in the post.
Some background, I created custom ML Framework (sklearn based) within my current organization. It's a ML framework so as you guessed it, there are multiple data sources which are accessed frequently. I decided to add a layer of cache to speed up things, initially I started with lru_cache
inbuilt module. Soon, I realized that we need some persistent caching as we fetch same data over and over (atleast while development) and all of that data is static data.
This layer of caching was needed as we use Bigquery, even though I have turned on caching on bigquery side, fetching data takes some time and you still pay for egress. Caching in my case is not only speeding up things but reduces the cost as well.
I looked at Redis & Memcache, but I didn't have clarity on how those services will be deployed with our production setup. Also, I wanted to try out the idea first, so I decided to go with "DiskCache" an amazing python module. It uses SQLite to store the content on local file-system. I ran it with 32 processes, and Pandas DataFrames up to 500MB, and It worked without any issues. Data is fetched and put into Diskcace i.e. sqlite and then added a layer of lru_cache
on top of that, so during execution same data is served from memory.
Now you have some idea about my caching setup, lets talk about the issue I had and how it came to surface.
In 1.5 years, Framework was mostly used for training 1000s of small models (of similar type, changing target, estimator etc.). Caching was used to store the pandas DFs and some other kinds of objects which were only accessed for reading. As more users started playing around with the tool to try out their ideas and few users complained about randomly getting wrong results. It was not easy to reproduce as on consecutive runs everything worked fine. Till, we found the problem.
As a coding standard in framework, whenever we fetched data (i.e. Pandas DF) and did any kind of processing (transformations) we always saved in a new dataframe. Some users started doing inplace=True
out of habit and this not only changed the current result but also the result in the cache. Lets dive deeper into why this happened?
I am going to mimic the issue with dict instead of data frame.
I have a decorator that wraps the function/method with cache. Something like below,
from functools import lru_cache
import time
import typing as t
@lru_cache
def expensive_func(keys: str, vals: t.Any) -> dict:
time.sleep(3)
return dict(zip(keys, vals))
def main():
e1 = expensive_func(('a', 'b', 'c'), (1, 2, 3))
print(e1)
# Now, we have e1 in cache, so fetching it again should be instant
e2 = expensive_func(('a', 'b', 'c'), (1, 2, 3))
print(e2)
# Since we store a dict or DF object in the caching what should happen if i change
# the fetched value?
e2['d'] = "amazing" # should it change anything in cached value?
# If you answered it no, then you are in for surprise...
print(e2)
e3 = expensive_func(('a', 'b', 'c'), (1, 2, 3))
print(e3) # you will see that e3 also has a "d" a new key added.
if __name__ == "__main__":
main()
Why do see this behaviour? Because lru_cache
gives you a reference to cached variable, and when you modify, it modifies the actual value referenced by the pointer.
Solution
Since we know that issues is due to working with reference of a cached value and I have no control over how it will be used downstream. easy (& simple) solution I implemented was to return a copy every time you are returning a result. Since you are working with a copy of data user can change it or do anything with it. This results in data duplication, you might have multiple copies of same data, it was not a problem for us.
I am going to mention couple of points from some architecture books I am currently reading,
- No architecture is good or bad, it's how expensive (time-wise) is to make a change.
- Architecture is about making decisions with least amount of bad things (based on known-knows, known-unknowns).
- It's always a compromise, if you think your solution has no compromises then there is a chance that you have not understood the problem fully or thought about all scenarios (or not yet surfaced - unknown-unknowns).
Now, getting back to solution, all we need is a small decorator which decorates the lru_cache
and returns a deepcopy
of the cached object every time its accessed
def custom_cache(func):
cached_func = cache(func) # evaluated only once!
@wraps(func)
def _wrapper(*args, **kwargs):
return deepcopy(cached_func(*args, **kwargs)) # gets called every time
return _wrapper
Lessons Learned
- Understood
lru_cache
in bit more detail. - Your recommended approach can prevent bugs.
- Users are not going to listen / follow same coding practices like developers do, so try to fix the implementation whenever you can.
Top comments (0)