Data deduplication or the elimination of repetition of data to save storage space and speed transmission over the network – sounds good, right? ‘Data deduping’ is currently in the spotlight as a technique to help organisations boost efficiency and save money, although it’s not new. PC utilities like WinZip have been compressing files for some time. The new angle is doing this systematically across vast swathes of data. By reducing the storage volume required, enterprises may be able to keep more data on disk or even in flash memory, rather than in tape archives. Vendor estimates indicate customers might store up to 30 terabytes of digital data in a physical space of just one terabyte.
How then do you set about recovering deduplicated backups as part of a disaster recovery action? This is where data deduping needs a little more thought and DR planning. Whereas individually zipped files can simply be run backwards through the zipping application, massively deduped data use a separate database of references that point to places where duplicate data existed. If you want your data back in their original form (including instances of repetition), that database of references is a crucial component. And that’s assuming you haven’t been a victim of a further problem for disaster recovery, the ‘hash collision’.
Essentially, the deduplication process assigns a unique hash number or reference to each chunk of data processed. It then compares these hash numbers to detect duplicate data (chunks that have the same hash number). However, some hashing algorithms may occasionally generate the same hash number for two different chunks of data. In that case, data will be lost because the system will consider the two chunks to be the same and not fully record the second chunk. So if you’re thinking of deploying data deduplication, remember to factor in both of these aspects when you update your disaster recovery planning.