Sharding the unshardable

Published Dec 20, 2020

While sharding seems to be the "silver bullet" that allows us to achieve greater than O(N) efficiency, it's not always possible. Sharding implies that the data set is completely independent of ordering.

For example, counting the number of lines in an editor's workfile is completely shardable — each shard counts the number of lines in its piece, and then, at the end, all the individual counts are summed.

General searching is another great example — it doesn't matter what order the elements are found in, all you want is to find all the places where a given pattern occurs. So, you shard the search space over the available processors, and, when they've all finished, you've found all the things you're looking for.

Conversely, hashing is generally viewed as an unshardable operation — you need to start at the first byte and hash all the way, sequentially, through the last byte in order to come up with "the" hash.

Syntax highlighting in an editor is also generally viewed as unshardable. You need to start at the beginning, maintain a state machine (are we in a comment? are we in a string? are we in an #ifdef?) all the way through the last byte, in order to come up with a coherent view of the source file.

These "sharding don'ts" share an implicit Waldo Emerson-style "foolish adherence" to sequencing.

Let's examine them in turn.

Sharding Hashing

Coming up with "the" hash is not shardable, because it's sequential. But coming up with "a" hash certainly is! For most practical purposes, it really doesn't matter if the hash is computed as a "hash of hashes" over a number of component hashes, or whether it's computed as "the" hash over the entire dataset. All that's generally relevant is an answer to the question, "is this document the same as another document?"

Thus, one can simply break a document up into shards, and hash the shards. The can be done in O(N/cpu) time. Then, once all threads have computed their hash, a final "hash of hashes" is computed over the hashes. This "hash of hashes" is as secure as "the" hash, and has as much meaning. It's extremely unlikely that another document will have the same sharded-hash as another document.

This is standard crypto fare, so don't celebrate just yet.

Sharding Syntax Highlighting

In my Shardable Editor article, I discuss various ways of making editors fast. The last hurdle I needed to solve is the syntax highlighting hurdle.

First, some benchmarks. When I open a 900MB 20Mline file, my editor takes some 4.7 elapsed seconds to do this (assuming the file is in cache). Digging deeper, 750 ms of that is occupied by legacy non-sharded hashing (I hash each file that I open in order to see if I can retrieve editing history for it), and a whopping 3.8 seconds are spent doing syntax highlighting (and that's with a really efficient super-simple state machine!)

The other 300ms are spent actually loading the file. Here's the first takeaway from that:

If you are single-threaded for any significant period of time, you're not making effective use of your resources.

I've already optimized the file read (by sharding the terminator counting and line pointer stuffing) — that's why it only takes 300ms. It's downright embarrassing that other operations take over 10X the amount of time as the real "work" of loading the file!

But you can't shard syntax highlighting, right? I mean, I could see an open comment block ("/*") in one shard, and then go ahead and highlight subsequent text in the next shard, completely unaware of the open comment block (which implies that I shouldn't highlight syntax because I'm in a comment)!

Ummm, yeah ... "but."

The "but" is statistics.

Let's assume that we naively sharded our source code based on lines (so, for example, in a 20Mline file with 20 threads, each thread gets 1M lines to do).

You ever see a 1M line comment?

Me neither.

So the practical solution here is to go ahead and shard the file and perform syntax highlighting. Pay no attention to what may have come before — just highlight starting at your shard start point.

The trick is, when everyone is done, then look at the edges.

Are there any inconsistencies? For example, does the highlight tagging state that you're maintaining for each line suggest that the last line of a shard is "in a multiline comment" and the first line of the next shard is "in regular text"? If so, smooth over the difference — start with the "in a multiline comment" state, and re-highlight the next shard, sequentially line-by-line, until you encounter a line that agrees with the state you've now reached (with the information from the previous shard).

Statistics says that it will be a few lines, or a few tens of lines, or a few hundred lines. Tens of thousands of lines? Most likely not. Millions? Never seen it.

I call this the "drywall trick" — when you join two pieces of drywall together, you kind of smooth over the joints with tape and mud. The rest of the drywall surface (in fact, the vast majority of it), is a smooth plane that doesn't need any further work.

Conclusion

Just because something doesn't "look" shardable doesn't mean that it can't be sharded.

Frequently Anticipated Questions

What's the best choice for sharding? Should I go by number of CPUs or some fixed size "work item" quanta? As with most things, the answer is, "it depends." Some examples follow. In the hashing case, it really doesn't matter from the point of view of the result — since the size of the object being hashed is an implicit factor in the resultant hash (two objects of different sizes almost certainly won't have the same hash), whether you compute 24 equal sized hashes and then a hash-of-hashes, or whether you chunk the file into 64k blocks and hash the blocks, really won't affect the end-goal of coming up with a unique "signature" of the file. Where it will make a difference is if you run the editor on one box with 24 CPUs and then expect to compare hashes when running it on another box with 32 CPUs — since you've hashed / sharded based on the number of CPUs, your resultant hash isn't going to be portable unless the CPU count is included in the meta data. In that case, going by fixed chunk size is way more portable. In the highlighting case, it doesn't matter as much , because the final product is the same regardless of whether you went by number of CPUs or by chunk size. I haven't (yet) done the research on the efficiency of sharding by CPU vs chunk on this one, but will be doing so soon. My initial implementation will probably just shard over the number of CPUs.
So how much faster is a sharded hash? I'm glad you asked — I just did that this morning; I went with a 64k line "block size" for the hash, which on the benchmark file I use translates to around 3 MB of data. Each block is done in anywhere between 5 ms and 12 ms or so depending on the actual data size and CPU loading. The total results are that the hash went from 750 ms to a mere 115 ms — or about 6.5X faster! [Why not 24X faster, since I have 24 threads available? I'm guessing maybe memory bandwidth: 20MB in 115 ms is 8GB/s and the DDR4 I have in my machine probably peaks around 24GB/s, so I'm running at 1/3 of the memory bandwidth.]