Linkify

Linkify is a filesystem de-duplicator. It's finds copies of the same file in two directory trees and turns one into a hard link to the same file contents as the other, and frees up the disk space that was used by the duplicate file.

This is useful where you have multiple snapshot backups of a directory tree, and you want to remove unnecessary duplication. It only works on filesystems that support hard links, which includes pretty much any default filesystem under UNIX, but will not support Windows/DOS type filesystems.

One tree is the 'target' that will have distinct files replaced with hard links, and the other tree is the 'reference', against which each file will be compared and potentially linked. In practice, it makes no difference which tree is specified first, because both will end up with files pointing to the same disk blocks. However, in cases where one tree is much larger than the other will run faster if the smaller one is the target tree (specified first). When de-duplicating backups, it is typical for the newer tree to be larger (as files accumulate over time) and so it would usually be faster to specify the older tree first and use the newer tree as the reference tree.

You can run linkify multiple times on the same pairs of directories. Any files that have been linkified will not be re-linkified because the two files will have the same inode number and thus will be skipped.

Due to the way hard links work on POSIX/UNIX systems, both directories must be on the same filesystem. Timestamps are preserved by setting the timestamp of new "link" filename to the original (not the reference) file. The new link entry should look exactly the same as the original one.

File ownership and permission settings are similarly preserved - each linkified file retains the ownership and permissions of the original file.

Warnings and Disclaimers

Warning: Data de-duplication saves space (yay) but it reduces your scope for data recovery (boo). If you have several distinct (not de-duplicated) backups, and you get a disk failure in the data blocks of one backup, you can retrieve the damaged file from another backup. But after de-duplication, both files are pointers to the same data blocks, and any corruption will affect both backups. It's recommended that you keep multiple discrete backups and duplicate only some of them. De-duplication is simply a tool to help with your backup strategy, but the creation and implementation of that strategy is your responsibility.

Disclaimer: I'm not responsible for your loss of data under any circumstances, including but not limited to use or mis-use of this software. You use this software entirely at your own risk. This software changes data on your drive, and using it might cause you to lose data.

Copyright

Linkify is copyright under the GNU General Public License.

Download

Download/view it here (v0.2) or review the changelog.

Installation and Usage

To install and use:
  1. Copy the script into an appropriate place, such as /usr/local/sbin.
  2. Run the script with suitable parameters.

Usage: linkify [ -v | -d ] target-tree reference-tree

The "-v" and "-d" options enable verbose and debugging modes, respectively. Adding debug options causes more detailed debugging information.

Example Usage

Here's an example:

 linkify -v /var/spool/backups/20100104 /var/spool/backups/20100105

This example shows verbose output, so every file that is linkified will be displayed. Every filename in the first (target) tree will be searched for in the same place in the second (reference) tree to see if the file is has the same name, is not a hard link to the same file (same inode number) and has the same contents (using an md5 checksum).

The "linkification" process consistes of the following steps for each file:

  1. rename the target file to a temporary filename .linkify.original-name.PID
  2. create a hard link from the original target filename to the reference file
  3. change the timestamp on the new filename to match the temporary filename
  4. remove the temporary filename, thus freeing up the duplicate data blocks

Possible Problems

It's possible for linkify to fail during the above 4-step process. In most cases there will be no ill-effects. In the worst case, the original file will have been renamed to the temporary filename .linkify.original-name.PID and the new filename won't yet have been created. In some cases, the new file will have been created but the temporary filename won't have been cleaned up.

If you think there might have been a problem, simply search the target tree for filenames matching the format used, for example:

 find /var/spool/backups/20100104 -name .linkify.\* -ls

Then rename the file to what it was and you should be back to how things were before.

Platform

Linkify was developed, and has been tested and used under Linux. It should run on other POSIX systems with little alteration. However, it makes use of GNU touch, chown and chmod to change the timestamp, ownership and permissions using a reference file. Standard POSIX find is used to traverse the target tree.

Feedback

Comments, submissions welcome: jeremy+linkify@laidman.org