DEV Community

Dimitri Merejkowsky
Dimitri Merejkowsky

Posted on • Originally published at dmerej.info on

Comparing directories: a case study

Originally published on my blog.

The problem

Let’s say we have a directory called O that contains a large amount of subdirectories and files.

We find ourselves with two different directories, A and B. Each of them started as a copy of O, but then, in each of them, some files were changed, some were added or removed, and we don’t know which ones.

We want to compare the contents of A and B, for instance to merge them back into O.

There are several ways to solve the problem, depending on the situation. Let’s take a look.

Using a GUI

For our first case, let’s assume A and B are on our development computer and we have a graphical session opened.

We can use many GUI tools for this task, here’s an example with Kdiff3:

kdiff3 screenshot

On the top left pane we have an overview of all the differences between the A and B directories, and we can see that:

  • The bar/eggs and bar/spam directories are identical between A and B
  • There’s a new file named bar/new.txt in A
  • There’s a new file named baz/six.txt in B
  • The file foo/a.txt was changed both in A and B

On the bottom panes, since a.txt is selected, we can see the differences in the contents.

Using the command line

Now let’s assume A and B are both on a remote server running Linux1 and we can’t open a remote graphical session there.

What are our options?

Well, we could use a coreutils command.

Coreutils is a package which is both very old and present in almost any Linux operating system. It contains a collection of useful, small command line programs. Even when it is not installed, an other package (like busybox) is usually here to provide similar functionality.

Anyway, there is a coreutils command named diff we can use for the task at hand:

$ diff --recursive a b
Only in a/bar: new.txt
Only in b/baz: six.txt
diff --recursive a/foo/a.txt b/foo/a.txt
1c1
< this is a
---
> this is modified a
Enter fullscreen mode Exit fullscreen mode

Boom! All the info we need.

diff offers lots of ways to tweak its behavior2. For instance, we can get more succinct info with the --brief option:

$ diff --recursive --brief a b
Only in a/bar: new.txt
Only in b/baz: six.txt
Files a/foo/a.txt and b/foo/a.txt differ
Enter fullscreen mode Exit fullscreen mode

By the way, if git is installed on the remote server, we can also use it, with nice options like --stat or --word-diff3

$ git diff --stat a b
 a/bar/new.txt => /dev/null | 1 -
 /dev/null => b/baz/six.txt | 1 +
 {a => b}/foo/a.txt | 2 +-
 3 files changed, 2 insertions(+), 2 deletions(-)

Enter fullscreen mode Exit fullscreen mode

Remote directories

Now let’s assume A and B are on two different remote servers: let’s call them abbot and costello respectively.

Also, let’s assume A and B contains lots of big files, so we cannot copy the A directory from abbot to costello and use the previous technique. We can, however, transfer small files between the two servers. Here’s we can do.

First, we log in to abbot:

ssh user@abbot

Enter fullscreen mode Exit fullscreen mode

Then, we go inside the a directory:

cd a

Enter fullscreen mode Exit fullscreen mode

Then we nun the following command:

find . | sort > ~/manifest-a

Enter fullscreen mode Exit fullscreen mode

Explanation:

  • find . lists all the files and directories inside the current working directory.
  • The order of the files returned by find are not deterministic and can change form one file system to an other, so we use a pipe (|) to take the output of find and pass it to the sort command. (sort is also part of the coreutils package).
  • Finally, we use an angle bracket (>) to write the output of sort to a file in the home directory: (~/manifest-a): we must not write the manifest-a file inside the a directory, otherwise the manifest may contain itself!

Now, we do the same on costello:

ssh user@costello
cd b
find . -type f | sort > ~/manifest-b
Enter fullscreen mode Exit fullscreen mode

Then we use scp to transfer the A manifest to costello:

# On abbot
scp manifest-a user@costello:

Enter fullscreen mode Exit fullscreen mode

Now we have two text files on costello, both containing a list of all the files. So, we use diff on the manifest files themselves:

diff ~/manifest-*
...
-./bar/new.txt
...
...
+./bar/six.txt

Enter fullscreen mode Exit fullscreen mode

That takes care of the contents of the directories. What about contents of the files themselves ?

Well, we can use an other coreutils tools called shasum.

shasum can be used to compute a checksum of the contents of a given file. It will generate the same output if the contents are the same, and if two files are different, there’s almost zero chance their checksum will be equal. 4

We can call shasum with a list of files as arguments, like this:

shasum file1.txt file2.txt
Enter fullscreen mode Exit fullscreen mode

Here we need to call shasum with the whole list of files, so we call find . again but with the -type f option to select only the files and ignore the directories. This gives us a list of lines, so we use xargs (also in coreutils) to convert them into a list of command line arguments5, and finally write the result in a a.shasum file:

find . -type f | xargs shasum > ~/a.shasum
Enter fullscreen mode Exit fullscreen mode

Here are the contents of the a.shasum file:

791c4ba196e0faea35e0c5fbe46e64da bar/eggs/two.txt
23f4a2592b2e4dee0444983a6e53c23e bar/new.txt
...
Enter fullscreen mode Exit fullscreen mode

Now we can transfer the a.shasum file to costello:

$ scp a.shasum user@costello:
Enter fullscreen mode Exit fullscreen mode

Finally, we can use shasum again with the --check option: it will read each line, parsing the file name and expected checksum, then compute the actual checksum for the given file and check if it matches the expected value:

$ cd b
$ shasum --check ~/a.shasum
./bar/eggs/two.txt: OK
./bar/spam/one.txt: OK
...
./foo/a.txt: FAILED
...
shasum: WARNING: 1 of 5 computed checksums did NOT match
Enter fullscreen mode Exit fullscreen mode

And we’re done: from the differences of the manifests we know which file are missing or were added, and from the list of checksums we know which files differ.

Conclusion

We saw that we can use nothing but command line tools from the coreutils package to compare the contents of two directories: the technique we used was working exactly the same way several decades ago, and will probably continue to work for a long time.

So, if you liked the techniques show here, here’s my advice to you: next time you are confronted with a task similar to the one we just studied, take a look at all of the coreutils documentation and try using them. If you do this often, after a while you’ll have gained a nice addition to your skill set.

Oh, and if you think “this is all useless”, please read this enlightening paper called Command-line Tools can be 235x Faster than your Hadoop Cluster :)

Cheers!


I'd love to hear what you have to say, so please feel free to leave a comment below, or check out my contact page for more ways to get in touch with me.


  1. I hear there are people running operating systems other than Linux on remote servers. I’m sorry if this is the case for you. 

  2. If you’d like to see all of what diff can do, feel free to run man diff and be amazed! 

  3. Git is smart enough to see that neither A and B are inside a directory it controls, and if A and B were inside a git working tree, you could use the --no-index option, as explained in the documentation[return] 

  4. You may already have heard of md5 sum: it works exactly the same way but uses an older algorithm to compute the checksum and its usage is generally discouraged. 

  5. If you use zsh you can als ask it to list all the files directly with a glob extension, like this: shasum **/*(.) > ~/a.md5. Yeah zsh! 

Oldest comments (0)