Erasure Codes and Archiving

Thinking about 100% durability

Recently I've been thinking about what is really required for fail-safe, 100% data durability. Most suggestions involve some large amount of duplicates for each piece of data but the cost of that can be significant. Although it is very possible that the idea of "just more duplicates" is correct, it never hurts to question those kinds of assumptions and explore other options. Anything that may be able to reduce the cost of ownership for a reliable data archive is worth considering.

One alternative that I discovered (or rather, rediscovered) while exploring other options is a technique for repairing data corruption: erasure codes. Put simply, erasure codes are a class of techniques for correcting errors in data; they belong to a larger class of such techniques called "forward error correction." These codes are used all over the world (and in space!) but don't be surprised if you haven't heard of them. They are one of the many unsung heroes of the modern technological world.

Although these techniques were originally invented as a means of dealing with lossy data channels in communications, they have since been extended into data storage technologies as well. Many low-level storage system like RAID and CDs/DVDs use some type of error correction codes. Storage and communications may seem like totally unrelated technologies but the problems that occur when transmitting data from one location to another are effectively the same as those with storing data; it is just a difference of which dimensions the data is being transmitted through. Communications deals with transmission through space and storage deals with transmission through time!

So erasure codes work well for increasing data integrity in low-level storage systems. They make it possible to recover data despite corruption without the need for an entire extra copy of that data. Maybe they could work well for a data archive as well.

The idea seems straightforward:

For an archive to be successful, it needs to maintain data despite corruption.
Erasure codes help correct data corruption.
Erasure codes can be used to correct corruption in an archive after it has occurred.
Therefore, archives should use erasure codes.

It's a simple idea and it can be easy to belive the conclusion is justified. Each of the first 3 statements are correct but the conclusion doesn't follow. The mistake here is one that is easy to make and, if you look, you can see people making it all the time. So what's the mistake?

Just because something can solve a problem doesn't mean that is the way that the problem should be solved. Or, put another way, just because something works doesn't mean that's what we should do. It's easy to stop looking for solutions when we find something that works but, by doing that, we have the tendancy to overlook the actual trade-offs such a solution makes.

I spent a while considering what erasure codes could do and how they could be a valuable part of an archive. The part that is missing from the idea above is what the actual cost is for using erasure codes and what you get in return. When the cost is compared to the alternatives and the limitations are considered, it becomes clear that erasure codes aren't a viable solution for any system that aims to have 100% data reliability. Let's look at why they can work well for something like a scratched CD but not for a data archive.

What erasure codes can do

In a basic way, a file can be viewed as a series of blocks of data. You can imagine those blocks to be of (almost) any size from a single byte - so an N byte file would be N blocks - to the entire size of the file - so the file would be just 1 block. (This is a simplification but the general idea is correct) Erasure codes create a kind of redundancy that allows missing blocks to be replaced with redudant blocks. However, to replace a block of some size (any size) you have to have a redundant block of equivalent size to replace it. (Well, almost the same size - more on that shortly) If you have two missing blocks then you need two replacement blocks. Three missing, three replacement. Et cetera.

And the best part is that any replacement block works for any missing block no matter where that missing block is located in the original file. There is a certain magical feeling to the way it works. Yay math!

Each redundant block can only be used once (and no, you can't trick it by running the replacement multiple times with the same block) but it's possible to create any number of those blocks from the original data. With modern erasure code algorithms it's effectively possible to design the algorithm so that you can replace any amount of missing data with the redundant data. You can define both the block size and the number of redundant blocks you want to create.

Sort of.

The limits

I say erasure codes work with "blocks" and not bits or bytes for a reason. I'm not trying to be obtuse or fancy using a different word. The math used in erasure codes works with matrices; any software that implements that math uses bytes distributed across the blocks (contiguous groups) of the original data to create those matrices. The redundant blocks that are generated "fit" with the original data to form a matrix which can have a certain number of values missing from the original data but still be solved. There is no way for the math to work without using the entire blocks.

Additionally, once the blocks are calculated, there is no way to decrease or increase their size without erasing the previously calculated blocks and generating entirely new ones. There is no way to decide you want to calculate extra blocks without first having a complete copy of the original data. If you need more repair blocks than you originally calculated (e.g. you calculated 2% redundancy but you've lost 2.1%) then you have lost your data. You can't use the redundant blocks you have to reduce the damage to the file; you can either repair the file or you can't.

What this means for data recovery is that you cannot think of being able to recover X% of a file's bytes. Instead, you have to understand that you are talking about recovering X% of the file's blocks. The best case scenario is that all of the lost data is concentrated into the minimum number of blocks; anything else results in fewer bytes recovered.

The reason this is true may not be immediately clear but a simple example may help.

The limits - an example

Say we have a 10,000 byte file that we split into 100 blocks of 100 bytes. We want to protect 10% of the original file's bytes (1000 bytes) - what number of blocks do we have to make redundant in order to guarantee this?

Let's start by assuming block and byte redundancy is the same and calculate 10% redundancy in blocks; that means we generate 10 blocks. 1000 bytes of 10,000 bytes is 10% and 10 blocks of 100 blocks is 10% so we are good right?

Think about the two extremes of how we can corrupt bytes in a file: 1. The lost/missing/damaged data is all collected together into the minimum number of blocks. In our example, that means we have all 1000 bytes fully grouped into 10 blocks (remember each block is 100 bytes). 2. The lost/missing/damaged data is uniformly spread across the entire file. In our example, that means we are missing 1 byte out of every 10 bytes. Since each block is 100 bytes, we have 10 byte of damage in each block.

In the first case, the 1000 bytes of damage is collected entirely in 10 blocks but we have 10 redundant blocks and so we have no problem recovery the 10% of bytes. However, in the second case we have 100 damaged blocks but only 10 recovery blocks! We have nowhere near enough recovery data to fix the file and it is lost. In this example, with the type of damage described in the second case, we only have enough blocks to cover 1% of the lost bytes: 10 byte per block for 10 blocks is 100 bytes and 100 bytes out of 10,000 bytes is 1%. Erasure codes don't actually work to recover these bytes but, if they did, that'd be the maximum number of repairable bytes.

But is 1% the actual worst case for damage that is distributed evenly? No, it's not even close to the worst. A single bad byte corrupts a block. We have 10 redundant blocks. That means that we just need 10 damaged bytes in the correct locations - one per block - and we have hit our repair limit. That means 10 bytes out of a 10,000 byte file is the maximum we can repair in the worst case scenario. That is 0.1% of the original data. This worst case depends on the size of the blocks used in the erasure code. One bad byte per block corrupts the whole block and so the bigger the blocks the fewer bytes that are needed to corrupt more blocks than we can repair.

So, the scenario in our example was to be able to recover the file with 10% of the bytes damaged. We see now that 10% recoverable blocks won't do it with the block size we used. In fact, we have proven that we need one recovery block per byte in the worst case. If we want to recover 1000 bytes that would mean 1000 blocks. But the original file is only 100 blocks! What does that mean about our example?

For 100 byte blocks, the only way to guarantee we can recover 10% of the bytes is to have a completely redundant copy of the data (100 recovery blocks for 100 blocks of original data). It is possible to reduce that by shrinking the block size but usually only in theory. The minimum block size is theoretically 1 byte but many implementations of erasure codes have higher limits than that. For example, Parchive 2.0 has a minimum block size of 4 bytes. At 4 bytes per block, the original file would be 2500 blocks and we would still need 1000 redundant blocks to recover 1000 bytes. So using the smallest possible block size, 4 bytes, requires 40% block redundancy (1000/2500) for a guaranteed ability to recover 10% of the data.

In fact, the only time that the required percentage of recovery blocks matches the guaranteed byte recovery percentage is when the blocks are 1 byte in size. The bigger the blocks, the less damaged data we can guarantee being able to repair.

The cost (space, not money)

Erasure codes are not actually magic. They work by doing calculations over the original data and generating new data so that the combination of new data plus partial original data allows for the damaged values to be recalculated. You can choose (almost) any block size and any number of redundant blocks to create but those choices make a difference in the amount of extra data needed in the redundant blocks. By "extra" data, I mean data beyond the size of data to be replaced. For example, if you have an original data block size of 50 bytes but, due to the way the math works, the redundant blocks must be 52 bytes then you have 2 "extra" bytes per block. Let's call this extra data "overhead" and can consider it a storage space cost beyond the original data.

Theoretically, it is possible to generate redundant blocks with no overhead. In this case, if you have blocks of size N in the original data then the redundant blocks are of size N as well. It is, in fact, possible to reach this theoretical performance in a real system (in fact I coded such an algorithm for a communication system many years ago) but it requires controlling the data size you start with and heavily restricts the data block size. If you don't have control over the input data size or you want to choose specific (read: "most") block sizes then you have to accept having overhead.

That isn't the only kind of overhead with using erasure codes with arbitrary files. There is another kind that isn't due to the math behind how they work but is just a common artifact generated by lots of applications: bookkeeping. In order to be able to work with different size blocks and different numbers of blocks, parameters of the algorithm have to be variable as well; the parameters that work for one set of data (one or more files) don't work for the next. An implementation of an erasure code algorithm requires some means of recording what parameters were used or it wouldn't be able to understand how to use the redundant blocks to repair the file later. There can be other reasons for bookkeeping overhead but it fits into the same category so I won't go into more detail. The total amount of bookkeeping will depend on the implementation but it will always be greater than zero bytes.

These two types of overhead can be summarized as "fixed" - which is a result of the implementation of the algorithm - and "block-dependent" - which gets larger the more blocks we use. The fixed overhead puts an effective lower bound on the size of file that erasure codes are useful for and the block-dependent overhead puts an effective upper bound on the number of redundant blocks we generate. The exact bounds depend on the implementation of the codes but they will always exist somewhere.

Surpassing these bounds means that our redundant data is of a greater size than the entire original file but can only repair damage to some of that file. In that case, it costs less (takes up less storage space) to keep an entire redundant copy of the original file than to use erasure codes.

Erasure codes are good for certain problems

So, with these kinds of limits, why are erasure codes used for anything? Think back to the example we looked at and remember what kind of pattern was required for erasure codes to be effective: the damaged/missing data has to be in contiguous chunks. That is exactly the sort of damage caused by most telecommunications problems (the communication channel cuts in and out) and a very likely type of damage to things like CDs (scratches are usually straight lines, not tiny flecks all over). They are great for communications and improving reliability.

In some situations, with the right design, you can turn control the way damage/loss affects the data. You can shape your expected types of failures into whole blocks being damaged and limit how distributed it is. This is exactly how RAID 5/6 works. It splits data into fixed-size pieces (blocks) and distributes them, along with the redundant blocks, evenly across multiple drives. The loss of a drive then becomes the loss of a fixed number of entire blocks. This is exactly the kind of damaged that erasure codes are most effective at repairing and it is what RAID 5/6 protects against. RAID 5/6 doesn't help with data corruption that slowly modifies a few bits here and there on different disks (or the loss of too many disks at once).

So as long as you have enough control of your data to keep it within the the limits and costs defined by your particular implementation of erasure codes then they are a valuable solution to the problem of data corruption. Some types of problems are able to provide that kind of control.

But erasure codes aren't good for archiving

Designing a system to have 100% durability (never losing data) isn't a situation that erasure codes can fix by themselves. There are too many failure modes that will cause the number of available redundant blocks to be exceeded. That is why CDs can become unplayable. It is why even a RAID 5/6 array can unrecoverably lose data. Erasure codes are means of restricting data loss but not eliminating it entirely.

Ultimately, a 100% durable system needs the ability to reliably replicate an entire file; repairing some part of a file is useful but not sufficient. Internal to such a system, it may be necessary to repair part of a file; corruption will definitely occur over time. But when you have entire copies of the original data, why would you want to pay the cost that comes with erasure codes? There are simpler ways to repair the file (rsync comes to mind) and, in the worst case, you can simply replace the entire file as a means of "repair."

I may be missing something but, for now, it still seems that 100% durability isn't achievable without maintaining multiple, complete copies of each file and that erasure codes aren't worth their cost.