Over at Ars Technica someone asked how to convert duplicate files into symlinks. (For those of you not familiar with unix, symlinks are like aliases but work with unix commands as well as most OSX applications) Originally I was going to do this using fdupes (installed via macports). However it is surprisingly slow. I came upon a Python program by Justin Azoff called dupes.py which runs much faster. (He discusses his methadology at his blog) Further, because it is a Python class, it is easy to call from a Python program.
The following assumes you have dupes.py installed (either in the same directory or the Python path)
#!/usr/bin/env python import dupes import os def make_ln( file_list ): main_file = file_list # make the first file the "master" file file_list = file_list[1:] # all the files we'll replace with links for del_f in file_list: os.system("ln -sf '" + main_file + "' '" + del_f +"'") def convert(paths): # paths is a list of paths to convert duplicates to links d = dupes.dupfinder() d.add_dirs(paths) for dups in d.find_dups(): make_ln( dups ) if __name__ == '__main__': import sys convert(sys.argv[1:]) sys.exit( 1 )
Basically the dupes Python module returns a list of a list of identical files. That is the first list includes for each element a list of identical files. So we path to our make_ln function a list of identical files. In that function we simply call the ln command which creates a link. The -f option forces a replacement, effectively deleting the original file. The -s option creates a symbolic link. When I use the above I actually prefer to use hard links. Those will only work across a single volume so there are reasons not to use it.
Let me now give the big warning. Often when you are doing backups you have quite a few files that are identical but that you would not want to replace with hard or symbolic links. That’s because the whole point is to have an active file that you can modify and then an older version you can restore to. If you make them links you’ve effectively destroyed your backup. Worse, you’ve destroyed it in a fashion where it appears like you still have a backup!
So this can be very dangerous. Only use it if you understand what you are doing.
Where this script can be quite useful though is when you have a backup disk that you use the hard link cp trick using cpio. (I’ll probably cover this soon) I have an rsync backup program that does this with all my programming files (in addition to using a versioning control) So I have are dated directories for each day with all my C, C++ and Python files yet all the identical files are actually just a single file. (More or less what Time Machine does – but here I get to control things a bit more)
The problem pops up when you copy this backup disk to a new disk. The copy program doesn’t realize these multiple files are just a single one. So my backup which only takes maybe 50 Meg suddenly balloons up to a rather large size. (10′s of gigabytes)
So this script will let me run it on my backups and more or less restore the hard links. To use it like this just change the system call to
os.system("ln -f '" + main_file + "' '" + del_f +"'")
To use the above just make it executable, run it at the command line with each directory you wish it to scan.
./dups2ln.py dir1 dir2 dir3