MD5 vs SHA1

Saturday, July 05, 2008 8 Comments

Plastic relies on hashes for certain operations related to file and directory contents.

Hashes can take some time to be calculated so choosing the right one can be important.

We're currently using MD5. Is it faster or slower than SHA1?

Here's the code I used to recursively calculate the hashes of the files of an entire directory:


class Hasher
{
static void Main(string[] args)
{
string path =
args.Length > 0 ? args[0] :
Environment.CurrentDirectory;
string hashtype = args.Length > 1 ? args[1] : "md5";
Hasher calculator = new Hasher();
int ini = Environment.TickCount;
calculator.Calculate(path, hashtype);
Console.WriteLine(
"Time hashing {0} files {1} ms. Method {2}",
calculator.mCount,
Environment.TickCount - ini,
hashtype);
}

int mCount = 0;
int mConvert = 0;

void Calculate(string root, string hashtype)
{
string[] files = Directory.GetFiles(root, "*");

foreach( string file in files )
{
Console.Write("Hashing {0}: ", file);
FileStream st = new FileStream(
file,
System.IO.FileMode.Open,
System.IO.FileAccess.Read,
System.IO.FileShare.ReadWrite);
string hash;
if( hashtype == "sha1" )
hash = CalcSHA1HashCode(st);
else
hash = CalcMD5HashCode(st);
Console.WriteLine(hash);
st.Close();
++mCount;
}

string[] directories =
Directory.GetDirectories(root, "*");
foreach( string directory in directories )
Calculate(directory, hashtype);
}

string CalcMD5HashCode(FileStream file)
{
MD5CryptoServiceProvider md5Provider =
new MD5CryptoServiceProvider();
Byte[] hash = md5Provider.ComputeHash(file);
return Convert.ToBase64String(hash);
}

string CalcSHA1HashCode(FileStream file)
{
SHA1CryptoServiceProvider sh1Provider =
new SHA1CryptoServiceProvider();
Byte[] hash = sh1Provider.ComputeHash(file);
return Convert.ToBase64String(hash);
}
}


And the results:
Time hashing 1379 files 3676 ms. Method sha1
Time hashing 1379 files 3515 ms. Method md5

So almost no difference between the two methods. Mono and .NET obtain the same results too.

What can really be a difference (obviously) is the filesystem status: the first time you try it on a directory it will be much (about 10 times) slower! And here Linux systems normally win the battle due to their cache management strategies.

8 comments:

  1. Would you be able to increase performance by splitting the hashing of the file away from the reading of the file?

    i.e.:
    Thread1 - Reads the files into memory
    Thread2 - Hashes the files which are in memory

    In my own app i was hashing large files in chunks of 16kB and by using two threads I improved performance by ~ 10%. In your case it may not help as most files are quite small. You'd also have to take care of the case where you have large files so you don't read a 100mb file into memory.

    ReplyDelete
  2. Hi Alan,

    Do you mean passing bytes instead of a stream and reading in parallel?

    My concern here is your proposal will eat a big amount of memory, won't it? I mean, I normally have to hash a big number of files...

    I'll try to figure out how to apply it anyway. Thanks for the suggestion!

    ReplyDelete
  3. Indeed, you might want to look at your file stream performance, to see if that is dominating.

    Compare to:
    http://botan.randombit.net/bmarks.html
    http://www.cryptopp.com/benchmarks.html
    http://www.cryptopp.com/benchmarks-amd64.html

    ReplyDelete
  4. Hey,

    Well, you can be smart with how you read data into memory. For example, you can only read a whole file into memory if it is less than 256kB in size. Then, you can limit yourself to a max of three files in memory at any one stage. So, if your cpu is slow, you can buffer 3 into memory, wait for one to hash, then buffer the next.

    If a file is large, you can read it in 256kB chunks and keep hashing the chunks one by one. The API allows you to hash files in pieces like that.

    You should be able to rig up a concept app easily enough which will tell you the potential gains. It could just be a console app with two threads, one reads files into a queue[byte], the other dequeues and hashes.

    ReplyDelete
  5. MD5 is significantly faster than SHA1, you don't know a thing about benchmarking.. you are limited by your filesystem speed. Put them in a ramdisk and then run it.

    And secondly.. are you guys seriously writing an SCM in C#? Are you out of your god damn minds? The SCM debate is effectively over- Linus won, as he usually does on technical merit, and he had the good sense to do it in C. Now, if you guys need something to do/sell, go about making git easier to use, or adding candy (merge tools, history viewers, etc) which improve workflow.

    ReplyDelete
  6. nosatalian, really funny comments.

    1- In case the test is limited by the FS, it is limited for the two methods, so results are still consistent. Again, reading can be really fast once data is cached, as any FS does.

    2- I "appreciate" your business advice, I hope it's based on a strong background... :-). Yes, we primarily use C#, and still we can outperform most of the SCMs out there.

    ReplyDelete
  7. Thats for letting that through Pablo.. whenever I see that 'Comment moderation has been enabled' I assume most of my colorful comments will end up in the ether.

    My point about C# is not really one about efficiency- clever algorithms usually beat a faster language. It just seems to me that the type of people who really care about SCMs (and hence would be writing them) are probably people with a lot of technical depth and experience. Most of those people tend to eschew C# due to the fact that Linux support is a joke, and they would rather work in an open source environment/language.

    But there may be very good reasons for you to chose C#- particularly if most of your value add is on the Windows UI side, then C# is probably the best bet for a nice UI.

    ReplyDelete
  8. Hi again nosatalian,

    I've to admit I was a little bit shocked about the way you wrote your "colorful" comment but, hey!, you don't learn if people don't tell you you're wrong! :-)

    Yes, you're right about the language. In fact, when we started developing Plastic SCM (4 years ago already!!!! we're growing up) my choice was C/C++ because it sounds like the "way to do things".

    I didn't know a line of C# back then, and being a former C++ developer (ok, maybe a C++ developer wanabee since you never master C++, do you? :-P) C# really looked like a toy thing to me.

    But Dave (CTO) knew C# (also a former C++ developer) and asked me to take a look into it. And then I had to admit that things that are normally difficult (or at least don't compile and run at first try) with C/C++, simply happen on C#. From dynamically loading code, to remote method calls, memory management and so on.

    Then we tried Mono (which was much younger than it is now), and we liked it, and we got code up and running on both Linux & Windows from day one (and MacOS joined only a few weeks after).

    To be honest, sometimes I still miss C/C++ (graphics on MacOS/Linux were not easy), but most because of the tools (compilers, profilers, editors everywhere) and thirdparty libraries (Qt) than the language/library itself.

    So far I never found a place on our code where C/C++ would be faster (maybe I'm crazy wrong), and at the end you can always do a good job with C#, even on Linux (very good speed). And if I find something simply not good enough in C# (zlib), we link against the native code (which is doable on Mac/Win/Linux).

    Maybe MD5 calculation is one of these things... :-P

    The only true downside is tool start up, which is slower (for a single cm command) than if it was native C, but normally it is not a huge hit.

    We outperform SVN by a factor of 10, even more under heavy load, and the same is true for other SCMs, although, as you pointed SVN is not the one to look anymore, but GIT. We can do really fast updates, even comparing to GIT (using network against local copy), because using threading and stuff is really easy in C# (doable, of course in C/C++).

    I'm always working on performance, so I'm really pushing to make Plastic faster... every day. GIT is a good benchmark, of course.

    ReplyDelete