If you work with Elixir on web services, you may have come across a library I work on named Cachex. Although the name makes it somewhat obvious, this is a library based on caching expensive data in memory. It works inside your application layers using the Erlang Term Storage (ETS) under the hood, whilst offering a bunch of convenience layers above.

Cachex is fairly popular; it's by far my most starred GitHub project, and I know of several companies (including my own) using it in production. It's been a while since the last major release of Cachex v3.0, which covered several missing sections of the API and refactored several things to be more Elixir-y. Since then there have been a few patch releases, but nothing pertaining to new features as the API is fairly complete.

Centralized Routing

One requested feature which was originally included in very early versions of Cachex (i.e. a few years ago), was the ability to "distribute" your cache instances across a cluster. This was originally implemented using Mnesia, and worked well enough at the time. The main issue was the overhead involved; because Mnesia was always used, this meant that someone using a cache that belonged to a single node would also receive the overhead of Mnesia. This was fairly substantial, especially then you look back and compare it to the current Cachex benchmarks. At this was undesirable in our use case, as well as many others, Mnesia (along with the ability to run a cache across nodes) was stripped out. The result was much higher cache throughput, and a much easier internal codebase to manage as it was now based on ETS rather than Mnesia.

Even so, the API layer was quite inflexible until recently. Touching code somewhere meant having to tweak it in several places. There was no centralization of the actual cache interaction, which was something I decided needed to change in order to simplify the interaction with the planned cache clusters in the future. The intent was for this to be as transparent as possible, so that it didn't require code change to switch between local/remote caches. This changed with the introduction of the new Cachex.Router module. This module was the base implementation of centralized dispatch, which would make it much easier to hook into in order to route across a a cluster.

Reintroducing Distribution

The initial version of the Router was compatible with the existing mainline. The Router itself didn't change any behaviour when first introduced; but it opened the way for this commit which introduced routing across a cluster. You might note that both this commit and the previous commit are quite small; interestingly, the actual delta between v3.0.3 (the latest release) and the v3.1 release is only around 2,000 LOC. This is because each action is centralized and so distributed behaviour only needs to be implemented in a single place; everything else works automagically.

What's even better is that there is practically zero overhead for those people using a cache on a single node; not a single piece of the new code is hit in such cases beyond checking the length of a list in the configuration. As such there's no cost to those developers, and in fact, there are actually performance gains in the v3.1.0 release due to the removal of some function hopping overhead through the use of the Router. There are however, a few things worth noting about the implementation:

  1. The distribution in Cachex works via a sharding algorithm; this means that there's only ever a single copy of a key on a single node in the cluster. If that node drops, or loses that memory, the copy of the key is gone. This can also be phrased as "Cachex does not provide replication".
  2. It's the responsibility of the developer to handle connections to other nodes; Cachex does not handle any reconnection logic, etc. Cachex is not meant to be a distributed state manager. You should use something else to manage your nodes and handle downtime in your cluster.
  3. If any node cannot be reached during startup, the cache will fail fast and not start up (since this is probably a bad situation).

There are many other things worth reading in the documentation, so I advise you take a look there before using a distributed cache. Not all actions function in the same way, and certain actions are disabled entirely. The documentation is the way to see the current state of what is supported and how it's supported.

A Quick Tour

Below is a simple example to demonstrate the transparency of a distributed cache. For this example, we'll just spin up a local cluster and distribute a cache instance across it. To enable distribution you provide the :nodes option at cache startup, and include the name of all nodes you're going to connect to. If you provide an empty list, or one containing [ node() ], your cache will run in local-only mode (which should be unsurprising). To try out a local cluster, the simplest way is to test it from the shell. You can easily start up two iex sessions in your terminal and use them to create a cache cluster:

# Terminal 1
$ iex --sname foo -S mix
iex(foo@localhost)1> Node.connect(:"bar@localhost")

# Terminal 2
$ iex --sname bar -S mix
iex(bar@localhost)1> Node.connect(:"foo@localhost")

# Both Terminals
iex(xxx@localhost)2> Cachex.start_link(:my_cache, [ nodes: Node.list ])

# Terminal 1
iex(foo@localhost)3> Cachex.put(:my_cache, 1, "one")
iex(foo@localhost)4> Cachex.put(:my_cache, 2, "two")

This will add two keys to a local cache cluster, 1 and 2. Even though this looks like a regular cache interaction, this will have transparently routed each key to a potentially different node. These keys in particularly are a good sample set, because they naturally map to different nodes. You can verify this by checking the local size of the cache on both nodes separately:

# Terminal 1
iex(foo@localhost)5> Cachex.size(:my_cache, [ local: true ])
{:ok, 1}

# Terminal 2
iex(bar@localhost)3> Cachex.size(:my_cache, [ local: true ])
{:ok, 1}

It's possible to provide [ local: true ] to run the size/2 command on only the current node. If I had run it without this flag, it would've returned the size of the cache in the cluster, like so:

iex(foo@localhost)6> Cachex.size(:my_cache)
{:ok, 2}

This pattern (of local: true) is supported on all cache-wide actions, so you can easily flick between local and/or distributed functions. The default is always based on the node list you provided at startup (so it will default to using cluster commands if you provided a list of nodes).

Distributed Caches

For a long time I recommended against including caches inside your cluster as it's typically bad to mix application/data layers. There's a tendancy in Elixir that just because you can do something, you should. This is not the case, and so you should really think about your use cases to determine if you can avoid caching across a cluster.

The easiest way to portray this is the following scenario; if you have ever depended on Redis or Memcached, did you install it on your application nodes? Probably not; you'd likely have a separate Redis instance and direct your cluster to that server as a separation of concerns. This is the same concept; although you can distribute all the things when using Elixir/Erlang, you need to consider when and where you truly should.

Having said this, feel free to try out Cachex v3.1.0 and please let me know if you find any issues or have any suggestions for future improvements. If anything is unclear, don't hesitate to reach out if you have any questions, or anything needs clarification.