DEV Community

dpsutton
dpsutton

Posted on

Exploring the core.cache api

Exploring the core.cache api

I was reading through the code of core.cache and wanted to figure out the "correct" way to use it. The protocol for it is a bit large so I wasn't sure what the best way to check/hit/miss was.

(defprotocol CacheProtocol
  "This is the protocol describing the basic cache capability."
  (lookup [cache e]
          [cache e not-found]
   "Retrieve the value associated with `e` if it exists, else `nil` in
   the 2-arg case.  Retrieve the value associated with `e` if it exists,
   else `not-found` in the 3-arg case.")
  (has?    [cache e]
   "Checks if the cache contains a value associated with `e`")
  (hit     [cache e]
   "Is meant to be called if the cache is determined to contain a value
   associated with `e`")
  (miss    [cache e ret]
   "Is meant to be called if the cache is determined to **not** contain a
   value associated with `e`")
  (evict  [cache e]
   "Removes an entry from the cache")
  (seed    [cache base]
   "Is used to signal that the cache should be created with a seed.
   The contract is that said cache should return an instance of its
   own type."))

The docs in core.cache mention the common usage pattern:

(if (cache/has? C :c)     ;; has? checks that the cache contains an item
  (cache/hit C :c)        ;; hit returns a cache with any relevant internal information updated
  (cache/miss C :c 42))   ;; miss returns a new cache with the new item and without evicted entries

;=> {:a 1, :b 2, :c 42}

Using the has?/hit/miss\ pattern ensures that the thresholding and eviction logic for each implementation works properly. It is built into the through\ and through-cache\ functions, as well as clojure.core.cache.wrapped/lookup-or-miss\. *Avoid this pattern at your own risk.*

This usage shows using a cache value but most usage in the wild will be based on some storage, usually an atom (although there are examples using mongo, sqlite, etc). It also shows hitting and missing but not the best way to do all of this and also get the value out.

I'm using the grep.app webapp to find usages in the wild. Search for usages of core.cache with https://grep.app/search?q=clojure.core.cache. The site grep.app is awesome and I want to contribute to the creator. You should use it to search for code if you don't already.

Examples

Stencil

Stencil is a mustache templating library using cache which is what metabase uses for its email templates in the search results from grep.app. It uses an api of get-in and assoc-in, notably never calling hit (miss is called by assoc under the hood). This means that any caches which depend on hit for stats won't work: notably LRUCache, LUCache.

(defn cache
  "Given a template name (string), variant key (string), template source
   (string), and optionally a parsed AST, and stores that entry in the
   template cache. Returns the parsed template."
  ([template-name template-variant template-src]
     (cache template-name template-variant template-src (parse template-src)))
  ([template-name template-variant template-src parsed-template]
     (swap! parsed-template-cache
            assoc-in [template-name template-variant]
            (template-cache-entry template-src
                                  parsed-template))
     parsed-template))

,,,

(defn cache-get
  "Given a template name, attempts to fetch the template with that
   name from the template cache. If it is not in the cache, nil will
   be returned. Single argument version gets the default variant."
  ([template-name]
     (cache-get template-name :default))
  ([template-name template-variant]
     (get-in @parsed-template-cache [template-name template-variant])))

A couple things stand out here. One, this uses get-in and assoc-in but the cache's don't seem to work that way. Second, by only using these two functions, hit and miss are not called so LU and LRU cache's can't compute frequency of use. This is especially bad since the default cache for stencil is an LRU cache:

;; The parsed template cache maps a template name to its parsed versions.
(def ^{:private true} parsed-template-cache
  (atom (try
          (require 'clojure.core.cache)
          ((resolve 'clojure.core.cache/lru-cache-factory) {})
          (catch ExceptionInInitializerError _
            (CoreCacheUnavailableStub_SeeReadme.))
          (catch FileNotFoundException _
            (CoreCacheUnavailableStub_SeeReadme.)))))

Some examples of how this caching strategy disobeys the eviction strategies:

(let [cache (atom (cache/lru-cache-factory {} :threshold 2))]
  (swap! cache assoc-in [:a :b] 1)
  (swap! cache assoc-in [:c :d] 2)
  ;; make [:a :b] the most recently used entry
  (get-in @cache [:a :b])
  (swap! cache assoc-in [:e :f] 3)
  ;; the [:a :b] value is not present
  (get-in @cache [:a :b])
  )
;; nil

(let [cache (atom (cache/lu-cache-factory {} :threshold 2))]
  (swap! cache assoc-in [:a :b] 1)
  (swap! cache assoc-in [:c :d] 2)
  (get-in @cache [:a :b])
  (get-in @cache [:c :d])
  (get-in @cache [:c :d])
  (get-in @cache [:c :d])
  ;; the cache now should only have :c and :e based on least used stats
  (swap! cache assoc-in [:e :f] 3)
  )

;; {:a {:b 1}, :e {:f 3}}

(let [cache (atom (cache/ttl-cache-factory {} :ttl 400))]
  (swap! cache assoc-in [:a :b] 1)
  (swap! cache assoc-in [:c :d] 2)
  (swap! cache assoc-in [:e :f] 3)
  (Thread/sleep 300)
  (swap! cache assoc-in [:a :new-key] :should-last-500)
  (Thread/sleep 300)
  ;; the [:a :b] key should be expired, the [:a :new-key] should be present
  ;; but the cache key is actualy just `:a`
  (prn [(get-in @cache [:c :d]) (get-in @cache [:a :b]) (get-in @cache [:a :new-key])])
  )

;; [nil 1 :should-last-500]

Rieman

Rieman uses a cache in the following way:

(def fun-cache
  "Speeds up the compilation of queries by caching map of ASTs to corresponding
  functions."
  (atom (cache/lru-cache-factory {} :threshold 64)))

(defn fun
  [ast]
  (if-let [fun (cache/lookup @fun-cache ast)]
    ; Cache hit
    (do (swap! fun-cache cache/hit ast)
        fun)

    ; Cache miss
    (let [fun (eval (list 'fn ['event] (clj-ast ast)))]
      (swap! fun-cache cache/miss ast fun)
      fun)))

This works well with the lru cache but actually NPEs on the lu cache. The problem is that the cache/hit function for the lu cache expects that the result is in the cache if hit is called and increments the hit count. But with the pattern above, there's no way to guarantee that. The following should npe spectacularly in a repl:

(let [lu-cache (atom (c/lu-cache-factory {} :threshold 4))
       thread-count 200
       loop-count 20000
       latch (java.util.concurrent.CountDownLatch. thread-count)]
   (doseq [i (range thread-count)]
     (.start (Thread. (fn []
                        (loop [c 0]
                          (if (< c loop-count)
                            (do
                              (if-let [v (c/lookup @lu-cache i)]
                                (swap! lu-cache c/hit i)
                                (swap! lu-cache c/miss i i))
                              (recur (inc c)))
                            (.countDown latch)))))))
   (.await latch))

Exception in thread "Thread-1664" Exception in thread "Thread-1682" Exception in thread "Thread-1690" java.lang.NullPointerException
        at clojure.lang.Numbers.ops(Numbers.java:1068)
        at clojure.lang.Numbers.inc(Numbers.java:137)
        at clojure.core$inc.invokeStatic(core.clj:927)
        at clojure.core$inc.invoke(core.clj:922)
        at clojure.lang.AFn.applyToHelper(AFn.java:154)
        at clojure.lang.AFn.applyTo(AFn.java:144)
        at clojure.core$apply.invokeStatic(core.clj:667)
        at clojure.core$update_in$up__6853.invoke(core.clj:6185)
        at clojure.core$update_in.invokeStatic(core.clj:6186)

...
Exception in thread "Thread-1851" Exception in thread "Thread-1792" java.lang.NullPointerException
Exception in thread "Thread-1739" java.lang.NullPointerException
Exception in thread "Thread-1863" java.lang.NullPointerException
Exception in thread "Thread-1840" java.lang.NullPointerException
Exception in thread "Thread-1850" java.lang.NullPointerException
Exception in thread "Thread-1825" java.lang.NullPointerException
Exception in thread "Thread-1859" java.lang.NullPointerException
Exception in thread "Thread-1740" java.lang.NullPointerException
Exception in thread "Thread-1743" java.lang.NullPointerException
Exception in thread "Thread-1741" java.lang.NullPointerException
Exception in thread "Thread-1849" Exception in thread "Thread-1742" java.lang.NullPointerException
Exception in thread "Thread-1839" Exception in thread "Thread-1729" java.lang.NullPointerException
Exception in thread "Thread-1862" java.lang.NullPointerException
Exception in thread "Thread-1845" java.lang.NullPointerException
...

This example also dereferences the cache twice, once to check if the value is present with lookup, and then again when doing the hit or miss functionality. This NPEs on the LU cache and seems less than optimal with the other caches.

clj-http

This repo includes an example of a caching middleware

(defn- cached-response
  ([client req]
   (let [cache-key (str (:server-name req) (:uri req) "?" (:query-string req))]
     (if (cache/has? @http-cache cache-key)
       (do
         (println "CACHE HIT")
         (reset! http-cache (cache/hit @http-cache cache-key)) ; update cache stats
         (cache/lookup @http-cache cache-key)) ; return cached value
         ; do not invoke further middleware
       (do
         (println "CACHE MISS")
         (let [resp (update (client req) :body slurp-bytes)] ; middleware chain invoked
           (when (and (http/success? resp) (= (:request-method req) :get))
             (reset! http-cache (cache/miss @http-cache cache-key resp)) ; update cache value
             resp)))))))

;; in essence
(if (cache/has? @cache k)
  (do (swap! cache cache/hit k)
      (cache/lookup @cache k))
  (do (let [value (compute k)]
        (swap! cache cache/miss k value)
        value)))

This has a good aspect on the miss branch and a bad example on the hit branch. I'm omitting the reset! usage as that's not great usage of the Atom api. The good aspect in the miss branch is that the value is computed, put into the cache, and then returned without looking the value up in the cache container. It is not guaranteed to be there. This error avoided in the miss branch is exemplified in the hit branch. The fact that one derefed cache value cache/has?\ an entry does not guarantee that lookup will succeed. This is seemingly true of all of the cache's besides the basic cache. And the LU bug should NPE here when hit is called and the cache doesn't have the value.

(doseq [[cache-type cache'] [#_[:lu (c/lu-cache-factory {} :threshold 4)]
                             [:lru (c/lru-cache-factory {} :threshold 4)]
                             [:ttl (c/ttl-cache-factory {} :ttl 20)]]]
  (let [cache (atom cache')
        thread-count 20
        loop-count 20000
        latch (java.util.concurrent.CountDownLatch. thread-count)
        has-but-missing (atom 0)]
    (doseq [i (range thread-count)]
      (.start (Thread. (fn []
                         (loop [c 0]
                           (if (< c loop-count)
                             (do (if (c/has? @cache i)
                                   (do (swap! cache c/hit i)
                                       (if-let [v (c/lookup @cache i)]
                                         ::good
                                         (swap! has-but-missing inc)))
                                   (swap! cache c/miss i i))
                                 (recur (inc c)))
                             (.countDown latch)))))))
    (.await latch)
    (println cache-type " has? but missing" @has-but-missing)))
:lru  has? but missing 74
:ttl  has? but missing 89
nil

Each of these is an example where the cache reported that it had a value and then subsequent access came back empty. This is a classic race condition. Using this pattern allows for the scenario where no value to computed and no value is retrieved from the cache.

Other examples of this:

A good example

https://github.com/datacrypt-project/hitchhiker-tree/blob/master/src/hitchhiker/redis.clj

(let [cache (-> {}
                (cache/lru-cache-factory :threshold 10000)
                atom)]
  (defn totally-fetch
    [redis-key]
    (let [run (delay
                (loop [i 0]
                  (if (= i 1000)
                    (do (println "total fail") (throw (ex-info "total fail" {:key redis-key})))
                    (let [x (wcar {} (car/get redis-key))]
                      (if x
                        x
                        (do (Thread/sleep 25) (recur (inc i))))))))
          cs (swap! cache (fn [c]
                            (if (cache/has? c redis-key)
                              (cache/hit c redis-key)
                              (cache/miss c redis-key run))))
          val (cache/lookup cs redis-key)]
      (if val (<?? val) @run))))

The important part is separating when the cache is a value and when the cache is a mutable location. The hit and has?/miss is inside of the the swap! function. The cache returned from the swap is the one used to get the value. There's a nice function that encapsulates this:

(defn through-cache
  "The basic hit/miss logic for the cache system.  Like through but always has
  the cache argument in the first position for easier use with swap! etc."
  ([cache item] (through-cache cache item default-wrapper-fn identity))
  ([cache item value-fn] (through-cache cache item default-wrapper-fn value-fn))
  ([cache item wrap-fn value-fn]
   (if (clojure.core.cache/has? cache item)
     (clojure.core.cache/hit cache item)
     (clojure.core.cache/miss cache item (wrap-fn #(value-fn %) item)))))

I think this is the proper api of core.cache. Use through-cache to get a new cache value and use that exclusively. The resulting cache mutable container is not guaranteed to have the value and cannot be relied on for lookup. This is of course standard interaction with an atom where multiple dereferences are bad. And of course the lookup-or-miss function in clojure.core.cache.wrapped is probably what should always be used:

(defn lookup-or-miss
  "Retrieve the value associated with `e` if it exists, else compute the
  value (using value-fn, and optionally wrap-fn), update the cache for `e`
  and then perform the lookup again.

  value-fn (and wrap-fn) will only be called (at most) once even in the
  case of retries, so there is no risk of cache stampede.

  Since lookup can cause invalidation in some caches (such as TTL), we
  trap that case and retry (a maximum of ten times)."
  ([cache-atom e value-fn]
   (lookup-or-miss cache-atom e default-wrapper-fn value-fn))
  ([cache-atom e wrap-fn value-fn]
   (let [d-new-value (delay (wrap-fn value-fn e))]
     (loop [n 0
            v (c/lookup (swap! cache-atom
                               c/through-cache
                               e
                               default-wrapper-fn
                               (fn [_] @d-new-value))
                        e
                        ::expired)]
       (when (< n 10)
         (if (= ::expired v)
           (recur (inc n)
                  (c/lookup (swap! cache-atom
                                   c/through-cache
                                   e
                                   default-wrapper-fn
                                   (fn [_] @d-new-value))
                            e
                            ::expired))
           v))))))

It seems to me that lookup-or-miss is by far the best way to interact with an atom backed cache and is basically the only function that should be called. The CacheProtocol is useful but invites the caller to do too much outside of a swap function.

clojure.core.cache.wrapped

The namespace clojure.core.cache.wrapped provides some helper functions for when operating on an atom stored cache. It provides dereferencing functions for has?, hit, miss, evict, and seed. I think these functions are footguns and should almost never be used as they invite checking the cache and operating on a cache which doesn't have the value.

Operating on just a cache value (not an atom)

The protocol even invites using has? and friends on a cache value rather than an atom, but even this is isn't safe. In the above examples, most of the bugs came from other threads hammering a cache in between a has? check and then lookup. However, even a value isn't safe in all cases. Most of the caches either have a value or don't so they are safe. But the TTL cache has a temporal aspect that can cause a has? check to return true and then the subsequent lookup to fail.

(doseq [[cache-type init] [[:lu (c/lu-cache-factory {} :threshold 4)]
                           [:ttl (c/ttl-cache-factory {} :ttl 50)]]]
  (let [cache-atom (atom init)
        thread-count 200
        loop-count 20000
        latch (java.util.concurrent.CountDownLatch. thread-count)
        has-but-no-value (atom 0)]
    (doseq [i (range thread-count)]
      (.start (Thread. (fn []
                         (loop [c 0]
                           (if (< c loop-count)
                             (do
                               (let [cache-value @cache-atom]
                                 (if (c/has? cache-value i)
                                   ;; this is just a value, so if it has, we can lookup?
                                   (if-let [v (c/lookup cache-value i)]
                                     ::good
                                     (swap! has-but-no-value inc))
                                   (swap! cache-atom c/miss i i)))
                               (recur (inc c)))
                             (.countDown latch)))))))
    (.await latch)
    (println "has? but no value: " @has-but-no-value)))
has? but no value:  0
has? but no value:  224
nil

The lu cache always had a value when has? returns true. But the ttl expired inbetween the has? and lookup checks on the value 224 times. Therefore, just using the CacheProtocol on an arbitrary cache isn't safe.

So what if we adjust this and use the through-cache function which uses the helpful logic that the resulting cache has the value?

(doseq [[cache-type init] [[:lu (c/lu-cache-factory {} :threshold 4)]
                           [:ttl (c/ttl-cache-factory {} :ttl 50)]]]
  (let [cache-atom (atom init)
        thread-count 200
        loop-count 20000
        latch (java.util.concurrent.CountDownLatch. thread-count)
        has-but-no-value (atom 0)]
    (doseq [i (range thread-count)]
      (.start (Thread. (fn []
                         (loop [c 0]
                           (if (< c loop-count)
                             (do
                               (let [cache-value (swap! cache-atom c/through-cache i (constantly i))]
                                 (if-let [v (c/lookup cache-value i)]
                                   ::good
                                   (swap! has-but-no-value inc)))
                               (recur (inc c)))
                             (.countDown latch)))))))
    (.await latch)
    (println "has? but no value: " @has-but-no-value)))
has? but no value:  0
has? but no value:  50
nil

When using the lu, we of course always have a value. But the ttl cache is still susceptible to returning a cache that doesn't have the value. It would seem that the only good way to interact with the cache is with clojure.core.cache.wrapped/lookup-or-miss.

As an addendum, when running the above comparisons and tracking the computations involved, we get the following numbers:

:lu
has? but no value: 0
computations: 19,566,746
:ttl
has? but no value: 45
computations: 565

and when run with the lookup-or-miss:

:lu
has? but no value: 0
computations: 7549
:ttl
has? but no value: 0
computations: 515

When we used the through-cache function we computed our value 19 million times, but had we not cached we would only have computed it (* 20 20000) = 400,000 times!. The contention to save the cached value dwarfed the number of times we even needed to check the value. The lookup-or-miss has a helpful strategy to just give up after 10 times and return the value. I also saw this same story when using core.memoize which uses core.cache with some major modifications under the hood.

The moral of the story

A couple points:

  • a core.cache is either a value or a mutable place which can give you a value. Use clojure.core.cache.wrapped/lookup-or-miss on an atom and be done with it. This should be the only function you use in most cases.
  • don't deref a cache (or in general, an atom) more than once. If you are, figure out how to use lookup-or-miss or through-cache instead.
  • the api kinda invites this. I think the functions in clojure.core.cache.wrapped for has?, hit, miss, and evict are foot guns and should never be used. The has?, hit, and miss, functions in clojure.core.cache should only be used inside of a single swap! function. However the protocol invites usage like this and the bugs are subtle.

Most of the bugs explored above arise only under high contention on the cache. Most caches probably aren't under this load and the bugs won't manifest. However, as things scale or caching code goes untouched for a while, this source of bugs could be quite nasty to track down and might only occur under production loads.

The above bugs in stencil, rieman, and the clj-http example should be fixed and I'm going to suubmit PRs for them soon. I want to raise an issue about the LU cache NPE but I'm not sure if its a bug or not. The docstring states that it should be called when the cache is known to have the key, but as discussed this is simply not possible. Most caches handle a missing key just find but the LU cache expects the key there and it to have a usage count.

I don't write often so I hope that the tone of this is exploration and thinking, not pointing at bugs. I wrote this on a Sunday when I was reading the core.cache library. Special thanks to Sean Corfield who writes so much amazing software and even more docs.

Top comments (1)

Collapse
 
lukaszblonski profile image
lukasz-blonski

Hey! Thanks for the insightful article! There is an interesting detail about how lookup-or-miss works: to actually get the value, the function performs a lookup after a miss or a hit. In there hides a dangerous assumption that a cache will always accept a new entry through the miss functionality. While that will probably is fine for the average use case, it fails for the edge cases where you may want a cache either of size 0 or a near-zero TTLs or maybe a custom eviction policy that won't always make space for the new entry. In the implementation of lookup-or-miss from the cache.wrapped namespace, if the the cache fails to accept a new entry, the return value of value-fn will never be returned to the caller, because a lookup will return nil.