Skip to content
Alexander Regueiro edited this page Nov 22, 2014 · 7 revisions

ddar

ddar is a free de-duplicating archiver for Unix. Save space, bandwidth and time by storing duplicate regions of data only once. Use ddar to:

  • Back up local data to a remote server, each time saving space, upload bandwidth and time by only transferring data not already present on the remote server. This is what cloud backup services like Tarsnap allow you to do already. Now you can use your own storage and internal bandwidth to achieve the same thing.

  • Back up remote data to a local disk, downloading and storing only changed data each time (the inverse of the above).

  • Back up local data to a local external disk, each time saving time and space by only writing data that was not present on the disk already. I have six gzipped tarballs of around 50 GiB each of my laptop stored on my external disk, but all together they are only using 64 GiB of storage.

  • Efficiently store any data that has redundancy. ddar will exploit redundancy across different files stored at any time.

Key Features

  • Free and Open Source. Have ddar installed and available for use on every machine. No fees and no account management.

  • Like Tarsnap, ddar uses a snapshot model. Each member (eg. each full backup) exists in its own right, with no incrementals, differentials, dependencies or deltas to worry about. Any member can be added, extracted or deleted without interfering with the others. ddar operates in O(n) time for storage, extraction and deletion, where n is the size of just that member.

  • ddar is fast — great for sysadmins storing and moving large volumes of data. Extraction will go as fast as your I/O will allow. Stores are fast too, although they are usually CPU bound. For example: my 2.13 GHz Intel Core i3 takes about 40 seconds to de-duplicate and store 1 GiB on a LUKS-encrypted partition.

  • As well as local de-duplication, ddar will also do local-to-remote or remote-to-local de-duplication, transmitting only data not yet present at the other end to save bandwidth. ssh is used to manage transport, authentication and encryption. ssh access is only required in one direction: you decide which.

  • ddar follows the Unix philosophy by focusing on de-duplication only. You can use it in conjunction with tar and gzip, or cpio, afio, pax or anything else. It isn’t even just for backups. Got a bunch of disk images, virtual machine images or ISOs that you think are similar? ddar will store them efficiently for you.

How It Works

  • The ddar program manages the storage of multiple data blobs in a container. Each blob is called a member. The container is called an archive. This is conceptually equivalent to storing multiple files in a directory or tarball.

  • Available operations on an archive are: create/append, extract and delete.

  • ddar optimises storage by de-duplicating data stored in an archive on the way in. Regions of data that are identical across members are stored only once. Apart from the amount of storage used, this behaviour is completely transparent to the user. Members remain first-class objects. Members remain logically independent with no inter-dependencies.

  • During a create/append operation, ddar can split the de-duplication task into two halves which communicate with ssh. The first half is where the source data being added is. The second half is where the archive is. In this mode, ddar will not transmit data from the source that is already present in the archive, thus saving bandwidth as well as storage.

##Alternatives

  • Tarsnap is a de-duplicating cloud-based backup service. Compare ddar to Tarsnap.
  • rdiff-backup can do some de-duplication subject to limitations of the rsync algorithm. Compare ddar to rdiff-backup.
  • Like rdiff-backup, duplicity also uses librsync and is subject to the same limitations, using forward deltas instead of reverse deltas. duplicity additionally supports encryption using GnuPG.
  • ZFS provides generic de-duplication, but only at block level, so for example inserting a single byte at the start of a copy of a file will cause that file to be stored twice. Apart from this key difference, storing multiple members in an archive managed by ddar and storing multiple files in a directory mounted inside ZFS are roughly equivalent. However, ZFS doesn’t operate at a layer that can provide bandwidth savings for remote archiving.

Caveats

  • ddar is new software. Please use it with caution until it has had wider use. ddar --fsck will check the complete archive for integrity; individual checksums are also automatically checked upon extraction.

  • If you compress data and then want it de-duplicated, you should use gzip --rsyncable or equivalent. Otherwise a single bit change will cause the remainder of the stream to change radically, so similar data will come out completely different and will no longer be able to be de-duplicated. ddar will still work in this case, just not efficiently, thus defeating its purpose.

  • ddar itself does not encrypt anything. ssh provides encryption of remote communication. If you need encryption for storage, you need to do it at filesystem level, for example with cryptsetup, encdrive, or BitLocker. If you need to store data on a system you cannot trust then ddar will not do. Try Tarsnap instead.

  • ddar relies on your storage. If you need data redundancy, you need to arrange it. If you’d like someone else to take care of providing storage and redundancy, try Tarsnap.

  • As ddar only does de-duplication, it must first read all data locally in order to de-duplicate it. Dedicated backup programs may be able to optimise this out by skipping files whose modification times have not changed. This means that some extra I/O and optionally extra CPU (for compression) is used on the source. In practice, this has not been a problem for me.

Licence

ddar is protected by copyright. If you wish, you may distribute it subject to the terms of the GNU GPL version 3.

Installation

I use Ubuntu 8.04, 10.04 and 10.10 in various combinations. Success and failure reports and contributions for wider system support are appreciated.

Ubuntu

8.04 Hardy Heron: install python-protobuf (all architectures) and then ddar (i386/amd64).

10.04 Lucid Lynx and 10.10 Maverick Meerkat: install ddar (i386/amd64). The installer will automatically pull in the python-protobuf dependency from the official tree.

Debian

For Debian stable and oldstable, use my Ubuntu Hardy backport packages as follows:

5.0 Lenny: install python-protobuf (all architectures) and then ddar (i386/amd64).

6.0 Squeeze: install ddar (i386/amd64). You may also need to pull in the python-protobuf dependency from the official tree.

Python sdist

You will need standard compiler tools (Debian: build-essential), setuptools (Debian: python-setuptools) and the Python development packages (Debian: python-dev) installed.

ddar depends on google.protobuf. Due to a bug in protobuf, setuptools/easy_install cannot currently install this automatically. On Debian and Ubuntu, you can install the package python-protobuf. For Ubuntu 8.04 (Hardy Heron), you can use my backport (all architectures).

Once you have protobuf installed, unpack the source dist and run python setup.py install. The man page is ddar.1; at the moment you have to install this manually if using setuptools (patches welcome!).

Source

The source is used to create the Python sdist and Debian packages. Satisfy the Build-Depends from debian/control and then make sdist or debuild as needed. For older versions of Debian and Ubuntu you can use my backports PPA for protobuf.

Getting Started

Back up from the local machine to a remote archive

$ tar c source_dir | gzip --rsyncable | ddar cf server:dest_archive

If a name cannot be determined (for example, ddar is reading from stdin as in this example), then ddar will auto-generate a suitable name based on the current date. To override this behaviour, use -N name.

Back up from the local machine to a local external disk

$ tar c source_dir | gzip --rsyncable | ddar cf /mnt/external_disk/dest_archive

Back up from a remote machine to a local archive

$ ddar cf dest_archive server:\!"'tar c source_dir|gzip --rsyncable'"

The ! indicates that the remote ddar instance should shell out to the specified command which will generate the data on its stdout. This must be escaped to stop bash from treating it as a history lookup. The remote command is run with sh -c, so the single quotes are needed to protect the expansion.

Display archive contents

$ ddar tf archive

Extract a gzipped tarball stored in the archive

$ ddar xf archive | gzip -dc | tar x

If a member is not specified, ddar will extract the most recently added member (based on insertion order). To extract a specific member, name it as a positional argument.

Further reading

For more information, see the man page.

Comments and Bug Reports

For now, please contact me directly (reCAPTCHA link via bit.ly). I’ll set something else up when I need to.

Future Directions

Here are some thoughts on possible future improvements. Feedback and feature requests appreciated!

  • A FUSE filesystem driver that can mount an archive. Read-only access is easy. A writeable filesystem is also possible with the current architecture, although it would be create, append and truncate only.

  • Parameterize the de-duplication algorithm to allow archive tuning based on the nature of intended data in order to improve efficiency. At the moment, data is split into 256K chunks (on average), so each change will use this much extra storage and bandwidth at a minimum.

  • More integration with other archivers such as tar, cpio and pax in order to create chunk boundaries on file or directory boundaries to increase efficiency.

  • More integration with other archivers to achieve I/O read skipping for files whose mtimes have not changed.

Credits

ddar is sponsored by Synctus, a multi-master, conflict-free, real-time file replication system. ddar was inspired by Tarsnap, a cloud-based de-duplicating backup tool.

You’ve read this far?! In that case, would you like to follow me on Twitter?