@yaaang's garden

The many faces of cache maintenance

Cache invalidation is hard. But cache maintenance is harder.

This is about any computation f(x)f(x) over some input state xx. But now you have to update that state xx, and thus your resulting output f(x)f(x).

How do you make this more efficient than a naive recomputation? And how do you do so without losing your state? What’s easy: computing f(x)f(x). What’s hard: computing f(x)f'(x).

So by “cache,” I don’t just mean “soft state” that can be willy-nilly evicted and is pure optimization that you can live without. I’m generalizing to any computation f(x)f(x) that must be maintained—any such derived state, including ones where losing that state would be Bad.

Considerations, in descending order:

  • Time/latency: can you be efficient about updating your cache?
  • Ergonomics: can you make it as “easy” to specify the computation as writing the original naive computation without thinking about incrementality? Because you can always write a ton of new code and maintain separate code paths in order to achieve time/latency gains.
  • Space: can you be efficient about how much memory or storage you would need? This is probably a distant consideration relative to the first two.

Here are some layers of complexity that can come into play, to keep in mind when thinking about the various instances of this problem:

  • Push vs pull
  • Concurrency/reprioritization of updates
  • Selectively choosing what parts to invalidate, e.g. in frontend—sometimes you don’t want strict consistency (as it can be jarring), or for space efficiency
  • Distributed systems: not sure yet, but I think generally makes it a bit harder, i.e. time and dependencies

Instances

This comes up everywhere, in lots of forms! Just a smattering:

Databases and data processing

This is a space very close to my heart! Databases have various forms of computed derivations of some source data.

  • Materialized views, incremental view maintenance. This comes in many different shapes and sizes, including everything from differential dataflow (Naiad, Materialize) to Noria.
  • Caching systems like Memcache, Redis, etc. Some of the more interesting research in this area on making maintenance of these caches easier is TxCache, which leverages in-DB support to track valid time ranges on records, and tracing what records went into producing which cache results.
  • Dataflow and stream processing systems—including systems like MapReduce, Spark, Spark Streaming, Naiad, etc.—are about computing derivations of some source data. Doing better than batch processing to support incremental updates is what research work such as Naiad, Timely Dataflow, and Differential Dataflow focus on.
  • The entire database can be seen as a materialized view of the write-ahead log at the heart of the database (plus any snapshots).
  • OLAP databases and data warehouses to which operational updates are streamed can be seen as materializations of a log of transactions from the OLTP database. Data change capture, etc.

Build systems

Compilers, transpilers, linkers, bundlers, minifiers, obfuscators, optimizers, imagers—build processes are large dependency graphs of all these different parts, ultimately taking some input source files and emitting output artifacts. Doing so efficiently, with the least amount of unnecessary work, is central to build system design and implementation. Most of these rely on some form of content-hashed invalidation.

  • Make and friends—weakly relies on timestamp invalidation
  • Docker’s layers
  • Blaze/Bazel
  • NX, TurboRepo, etc.

Computer architecture, operating systems

  • (Literal) caches: multiprocessing concurrency on cache-coherent memory architectures often have message-passing implementations
  • Page management

Devops

  • Infrastructure-as-code systems like Terraform update some materialization of some specification, but the set of updates/actions issued depends also on the state of the current live system. From there, we can compute a set of deltas.

Machine learning and optimization

This one is stretching the analogy a bit, but automatic differentiation and similar techniques to numerically compute gradients or backpropagation used by modern learning systems—these are literally trying to compute f(x)f'(x) tractably. As you stream in more examples, or if you incrementally transfer-learn new examples, you want to update your learned model ff (note this is not about maintaining f(x)f(x).

Interesting tidbit: 4 of Tensorflow’s original paper authors were also authors of the Naiad paper.

Frontend

Another place very close to my heart! Reactive view maintenance is what frontend development is about.

  • Reactivity
    • JavaFX, Knockout, MobX, etc.
    • (In fact, you can run databases in the frontend for your state management)
    • Often, nodes in a composite UI tree can each have their own internal state as well, updated independently of the source state from which the UI is regenerated
  • Rx, FRP: yuck (in terms of ergonomics)
  • React worries about concurrency and prioritization of updates, interleaved with side effects
  • Incrementally streaming real-time updates
    • Reactive database, Meteor (which was simply polling but it works), RethinkDB, Convex.dev, GraphQL streaming
  • On-demand static revalidation, etc.

Follow me on Twitter for more ramblings.