Sharding is one of the best known and most effective ways to scale out a persistence layer. Doing so allows you to get around the bottleneck of single-master writes that appears in replication schemes and can help availability and parallelization. One of the problems that can often arise, however, is that of rebalancing the data. Presharding adds a metadata layer atop the traditional sharding approach and can offer a number of benefits in practice - it’s true, I’ve seen it happen! :-)

An example

Consider that you have a system that requires user data to be written to persistent storage. This data is volatile, so in order to increase write performance you’ve done the right thing and decided to opt for a sharding solution. You have N shards, each running a well-balanced chunk of the required data. All is going well and you’re relaxing on the sandy beaches of the bahamas, when all of a sudden your application explodes in popularity! Suddenly you’ve got new user registrations coming in at a record setting volume, and checking your monitoring systems, you see that the hosts included in the persistence layer are getting beat down, storage is running low! Oh noes!

At this point, you start working out a plan to rebalance the shards. You need new hosts to come online… this means you need to start replication onto a new host to take part of the data, update your hashing code, and carefully do the switcheroo.

This is painful, nerve-wracking, and error-prone. Never fear, a better solution exists - presharding!

Bird’s Eye View

The key to presharding relies on the concept of logical and physical nodes. Physical nodes are instances of your persistence application (database instances, redis instances, etc). Logical nodes are what data is sharded to, and these in turn map to physical nodes. You start your persistence layer with some (large) number of logical nodes, say 256, and make your bucketing algorithm distribute keys among these logical nodes. When a value is needed, you determine what logical node the data resides in, determine which physical node is currently hosting that logical node, and go get that data! When the persistence hosts start filling up, adding new hosts is as easy as re-allocating the logical nodes and replicating the necessary data to the new hosts! Even better, if your hashing algorithm accidentally sucks, you can even partition your logical nodes non-uniformly (10 here, 2 there, 18 over there).

Nuts and Bolts

A couple of pieces need to be added to your application cluster in order to support presharding. In this post, I’ll describe a pre-sharding system I’ve built before that was pretty easy to maintain and scale.

Persistence layer

For this example, I used Redis as the persistence layer. The Redis server is a single threaded application, meaning that you can run numberOfCores instances on a single host without worrying about performance degradation. Each server instance binds to a distinct port.

Metadata host

In order to map logical nodes to physical nodes, something needs to maintain and serve up that mapping. Apache ZooKeeper is a good piece of technology for maintaining cluster-wide state such as this.

You’ll basically want to have ZK nodes that are keyed off of logical node identifiers, and provide the host and port of the physical node currently holding that logical node. Note that multiple logical nodes could live on a single redis instance if need be (though this can make maintainance a little more painful).

Client-side awareness of the sharding

Your application parts that access the persistence layer will need to know that they should check with ZooKeeper to determine where to connect for data, once they have converted a key into a logical node identifier. This is pretty easy to roll-your-own if nothing out there fits your needs specifically.

Putting it together

With these pieces in place, you will now have everything you need to build a dynamically growable, sharded persistence layer! Handling failover and host additions is described (for Redis) here.

With the high level ideas, you can leverage this same scheme with a variety of persistence solutions and cluster-state storage mechanisms - have fun!

See also: