November 11th, 2005

  Thoughts on remote diff…

Greetings,
I sat in my office today, listening to a phone interview in which the candidate was asked a number of simple problems, and then the harder design question:

Given two computers linked over a slow link (say dialup), each has a 1TB file, how would you determine (1) if the two are different, and (2) what that difference is?

Me being the gadfly that I am, I spoke with my office-mate a little about what the ‘right’ answer should have. He acknowledged that he was looking for a subdivision answer, which determined the files first difference location. It’s a pretty simple answer, conceptually subdivide the file into (say) 4096 separate 256MB blocks. Do the MD5’s on them, pass them over the wire, so each side knows the MD5’s at each 256MB step. If there’s a change, instantly at that point you know it was in the last 256MB’s, and you can subdivide that down by 4096, so you’re MD5’ing 64K blocks. This finds you the 64K block where the first change happened, and you can either subdivide one more time, or just transmit the 64K containing the difference and figure it out right there.

I pointed out that this only finds the first byte where there is a difference, and that if you wanted to know if some portion of the *rest* of the file is the same (or worse, you want to construct a minimum edit difference), you start to need to do more complex things. I went off, and started thinking about algorithms to make the rest of the problem tractable.

This is what I came up with, recorded here for posterity, and in case anybody with a decent sense of algorithmics would like to correct me.

It’s CPU-intensive, firstly. It’s CPU-intensive, because that’s what you’ve got. You can’t use bandwidth, because it’s defined as too low for this, so you have to use what you’ve got, and that’s the CPU power on each side.

So one machine designates itself as the ‘compare to’ server. It then chunks the *remainder* (point from diff on) of the 1TB file into fixed size chunks. (Say 64K chunks, which in the worst case would be 16 million of them.) Each 64K chunk is separately MD5’ed, and the results stored on disk, hashed by their bottom two bytes into a linked-list, along with the block number they’re for. The bottom two bytes are used as an index into a 64KBit bit-vector, and the bit at that location is set. Once the remainder of the file is completely processed, the bit vector is transmitted.

The ‘compare from’ client then creates (this is the computationally sucky part) 64K-1 simultaneous MD5’s, chunking from (diff+1) to (diff+65535). When it’s done chunking 64K of data (all 64K MD5s will complete at the same time), they’re checked, reset, and they all start again (at diff+65536 to…). This is defined as 64K-1 MD5s under the hope that some kind of internal MD5 optimization could be found that would prevent it from behaving as slowly as a single MD5 run on the order of (64K*(EOF-DiffPoint) times.

During the ‘check’, the bottom two bytes are used as a lookup in the bit vector. If any MD5s match a bit, all matching ones are transmitted, and the server checks if any are a recognized block. If any is, it returns the block number, and the next 64K MD5, which the client can quickly check by running a single MD5 on that data. If they match, the client and server can consider the rest of the file as if it were to be checked from the first step again, because there may be more than one point of difference. More precision can be gained by transmitting the difference data from the client. Since we know there is a difference at the first byte (by the first algorithm), we do a backwards compare from the end of the last different block on the server versus the block before we found an MD5 match. We then find the last byte of difference, as well as the first. (Because of the size of the data, you could transmit it to the server, and run a more normal minimum edit difference algorithm against that block, to get more precision.)

As I said before, treat the rest of the file (from server block ‘N’ and client block DIFF+n bytes to the block ‘N’ match) as a new file to compare. Scale down the blocksizes in this new instance, for quicker/greater precision, and do it again.

This should result in a decent minimum edit difference ‘collection’ on the server (or whichever side aggregates this data), without transmitting the entire 1TB of data over the wire. These numbers can be tuned, e.g. the two-byte bit vector can become 3 bytes, and still only takes 2MB up front to transmit the bit-vector, but substantially cuts down the likelihood of having sub-hash matches, and thus fewer packets sent later.

Now, a real problem comes if 1 byte in every 64K is modified on the client. The problem is that it’ll never match another block. The CPU load becomes pretty excessive at this point, effectively doing 64K MD5s of a 1TB(-64K) file.

At every block size, this problem remains, unfortunately, and with a file as large as 1TB, you really want to start with large block sizes. If, for instance, every 64th byte is altered, you’re pretty much doomed as that’s the blocksize for MD5 and you’ll never get a full match, and at that point in order to get a full edit difference between the two files, you will need to transmit one of them to the other.

So this approach is basically restricted to when you have an idea how many changes (or expect that the answer should be zero) have been made.

The original interview question, however, is even more restricted in that it solely finds the first location of a change, and does not identify what kind of change it is, which is in my book a not terribly useful answer.

— Morgan Schweers, CyberFOX!


No Comments

Comments are closed.