Confirm string GetHashCode reflects string, not object
Erick Thompson 1 May 2002 22:19:21
I need to check if a particular string has been seen before, so I was going to maintain a list of hashcodes from the strings. However, I want to confirm that the GetHashCode method returns a hashcode based on the string contents, not on the string object. The reason I'm not completely sure is that I'm not totally confident I know how string interning will effect things. Can I use GetHashCode to identify a string?
Thanks, Erick
You can read messages from the DOTNET archive, unsubscribe from DOTNET, or subscribe to other DevelopMentor lists at http://discuss.develop.com.
Erick Thompson 26 April 2002 23:41:21 [ permanent link ]
Upon further reflection, I think that using GetHashCode isn't a good idea to check if I've seen a string before. I think I'll use a MD5 sig instead, so I don't have to worry about collisions.
Erick
----- Original Message ----- From: "Erick Thompson" <ethompson@NBR.ORG> To: <DOTNET@DISCUSS.DEVELOP.COM> Sent: Friday, April 26, 2002 12:44 PM Subject: [DOTNET] Confirm string GetHashCode reflects string, not object
I need to check if a particular string has been seen before, so I was
going> to maintain a list of hashcodes from the strings. However, I want to confirm> that the GetHashCode method returns a hashcode based on the string contents,> not on the string object. The reason I'm not completely sure is that I'm
totally confident I know how string interning will effect things. Can I
GetHashCode to identify a string?>
Thanks,> Erick>
You can read messages from the DOTNET archive, unsubscribe from DOTNET, or> subscribe to other DevelopMentor lists at http://discuss.develop.com.>
You can read messages from the DOTNET archive, unsubscribe from DOTNET, or subscribe to other DevelopMentor lists at http://discuss.develop.com.
To answer your original question though, the string's hash function *has* to be based on the string value, not the object identity. (So interning or lack thereof will not affect you.) This is because the String class overrides Equals. Any class that overrides Equals is *required* to override GetHashCode too and to make the two consistent. (I.e. if Equals says two objects are the same, they must have the same hash values.)
However, just because two strings have the same hash code doesn't mean they are the same - GetHashCode *is* allowed to return false positives. (It's just false negatives that aren't allowed.) In fact here is a guaranteed valid-for-every-occasion implementation of GetHashCode():
public override int GetHashCode() { return 1; }
This doesn't break any rules. Of course it won't yield great performance if you use this on objects which are acting as keys in a HashTable... And it'll get you rather a lot of false positives for comparison.
But *any* hash has some chance of collision, unless the hash actually has at least as many bits as the data being hashed. The larger the hash, the lower the probability of collision will be (given a suitably good hash algorithm of course...) but the probability never drops to zero. MD5 is believed to be a good hash algorithm, and it has a length of 128 bytes (i.e. 1024 bits) IIRC, which gives a pretty low chance of a collision.
But the question I would ask is are you actually wasting more space by storing MD5 hashes than you would be by just remembering the strings in the first place? Unless your strings are on average longer than 128 bytes (the size of an MD5 hash) then it will actually be rather more expensive to do what you're doing than just to store all the strings... Or is there some reason you can't store the strings?
-- Ian Griffiths DevelopMentor
----- Original Message ----- From: "Erick Thompson" <ethompson@NBR.ORG>
Upon further reflection, I think that using GetHashCode isn't a good idea
check if I've seen a string before. I think I'll use a MD5 sig instead, so
don't have to worry about collisions.>
Erick>
----- Original Message -----> From: "Erick Thompson" <ethompson@NBR.ORG>>
I need to check if a particular string has been seen before, so I> > was going to maintain a list of hashcodes from the strings.> > However, I want to confirm that the GetHashCode method> > returns a hashcode based on the string contents, not on the> > string object. The reason I'm not completely sure is that I'm> > not totally confident I know how string interning will effect things.> > Can I use GetHashCode to identify a string?
You can read messages from the DOTNET archive, unsubscribe from DOTNET, or subscribe to other DevelopMentor lists at http://discuss.develop.com.
Erick Thompson 29 April 2002 20:34:21 [ permanent link ]
But the question I would ask is are you actually wasting more space by> storing MD5 hashes than you would be by just remembering the strings in
first place? Unless your strings are on average longer than 128 bytes
(the> size of an MD5 hash) then it will actually be rather more expensive to do> what you're doing than just to store all the strings... Or is there some> reason you can't store the strings?
There is no reason not to store the strings, except that I was a little concerned about memory useage. The number of strings is going to fairly large, so I was hoping for some savings using a hash. Given that the 128 bytes is actually a good sized string, and the possibility for false positives, I think I may go back to storing the actual strings.
Thanks for the good overview of GetHashCode. I think I need to up the time I spend with Richter's book.
Erick
You can read messages from the DOTNET archive, unsubscribe from DOTNET, or subscribe to other DevelopMentor lists at http://discuss.develop.com.
An MD5 hash is actually only 16 bytes, not 128. Given that each character in a string is two bytes (unicode), if your strings are typically longer than 8 characters, you ought to get a memory usage saving from using the MD5 hash.
-----Original Message-----> From: dotnet discussion [mailtoOTNET@DISCUSS.DEVELOP.COM]> On Behalf Of Erick Thompson> Sent: Monday, April 29, 2002 12:34 PM> To: DOTNET@DISCUSS.DEVELOP.COM> Subject: Re: [DOTNET] Confirm string GetHashCode reflects> string, not object>
There is no reason not to store the strings, except that I> was a little concerned about memory useage. The number of> strings is going to fairly large, so I was hoping for some> savings using a hash. Given that the 128 bytes is actually a> good sized string, and the possibility for false positives, I> think I may go back to storing the actual strings.>
Thanks for the good overview of GetHashCode. I think I need> to up the time I spend with Richter's book.>
Erick>
You can read messages from the DOTNET archive, unsubscribe> from DOTNET, or subscribe to other DevelopMentor lists at> http://discuss.develop.com.>
You can read messages from the DOTNET archive, unsubscribe from DOTNET, or subscribe to other DevelopMentor lists at http://discuss.develop.com.
*goes back and actually reads the documentation for HashAlgorithm.HashSize*
"Gets the size of the computer hash code in *bits*." (my emphasis)
D'Oh!
-- Ian Griffiths DevelopMentor
----- Original Message ----- From: "Joel Mueller" <jmueller@SWIFTK.COM>
An MD5 hash is actually only 16 bytes, not 128. Given that each> character in a string is two bytes (unicode), if your strings are> typically longer than 8 characters, you ought to get a memory usage> saving from using the MD5 hash.>
There is no reason not to store the strings, except that I> > was a little concerned about memory useage. The number of> > strings is going to fairly large, so I was hoping for some> > savings using a hash. Given that the 128 bytes is actually a> > good sized string, and the possibility for false positives, I> > think I may go back to storing the actual strings.> >
Thanks for the good overview of GetHashCode. I think I need> > to up the time I spend with Richter's book.
You can read messages from the DOTNET archive, unsubscribe from DOTNET, or subscribe to other DevelopMentor lists at http://discuss.develop.com.