This is my first post to this newsgroup, and I'm quite new to C programming, so please be gentle
I'm writing a small program in C that needs to take one or more words of input from the user, and then look these words up in a "dictionary" and inform the user whether all the words are good or not. For legal reasons, the dictionary cannot be in plain text, so I've been thinking about taking all the words in the dictionary and running them through a hash function, and using the resulting dictionary of hashes for lookups. I found a function posted by Daniel J Bernstein (djb hash) which looks suitable for this purpose. However, I'm not trained that well in mathematics, so I don't know how to tell if this function will produce any collisions for my word list. My word list consists of approximately 200,000 words, and the maximum length of a word is 15 characters. Does anyone have experience with this function (or anything similar), and if so, can you tell me if I'm going down the right path, or is this a non-starter?
The djb hash function is:
unsigned long hash(unsigned char *str) { В В unsigned long hash = 5381; В В int c; В В while (c = *str++) В В В В hash = ((hash << 5) + hash) + c; /* hash * 33 + c */ В В return hash; }
On Sun, Mar 19, 2006 at 10:36:39PM +0100, Anand Buddhdev wrote:
This is my first post to this newsgroup, and I'm quite new to C> programming, so please be gentle
c-prog is a mailing list, not a newsgroup...
I'm writing a small program in C that needs to take one or more words of> input from the user, and then look these words up in a "dictionary" and> inform the user whether all the words are good or not. For legal> reasons, the dictionary cannot be in plain text, so I've been thinking> about taking all the words in the dictionary and running them through a> hash function, and using the resulting dictionary of hashes for> lookups. I found a function posted by Daniel J Bernstein (djb hash)> which looks suitable for this purpose. However, I'm not trained that> well in mathematics, so I don't know how to tell if this function will> produce any collisions for my word list. My word list consists of> approximately 200,000 words, and the maximum length of a word is 15> characters. Does anyone have experience with this function (or anything> similar), and if so, can you tell me if I'm going down the right path,> or is this a non-starter?
You should be able to write a simple program to store the entire hash table in memory and check for duplicate hashes.
Your hash() function returns an unsigned long, which is 4 bytes wide on most systems, so the hash table will be 200,000 * 4 = 800,000 bytes, which is small for any modern PC.
Anand Buddhdev 20 March 2006 16:59:08 [ permanent link ]
andrew clarke <mail@...> writes:
This is my first post to this newsgroup, and I'm quite new to C> > programming, so please be gentle >
c-prog is a mailing list, not a newsgroup...
I found that out after I had posted the message via Gmane's NNTP interface. Apologies for using the wrong words.
I'm writing a small program in C that needs to take one or more words of> > input from the user, and then look these words up in a "dictionary" and> > inform the user whether all the words are good or not. For legal> > reasons, the dictionary cannot be in plain text, so I've been thinking> > about taking all the words in the dictionary and running them through a> > hash function, and using the resulting dictionary of hashes for> > lookups. I found a function posted by Daniel J Bernstein (djb hash)> > which looks suitable for this purpose. However, I'm not trained that> > well in mathematics, so I don't know how to tell if this function will> > produce any collisions for my word list. My word list consists of> > approximately 200,000 words, and the maximum length of a word is 15> > characters. Does anyone have experience with this function (or anything> > similar), and if so, can you tell me if I'm going down the right path,> > or is this a non-starter?>
You should be able to write a simple program to store the entire hash> table in memory and check for duplicate hashes.
Ok, I can do that. However, I would also like to know if there might be invalid anagrams or letter combinations that will produce a collision, because I need to distinguish good words from bad ones. But perhaps that is a question for a cryptography list, and not C programming per se. For example, if I have "aa", "be", "ch" and "de" in my word list, I want to be sure that no other words, for example, "yz", produce the same hash as the good words. Checking this manually for the set 'aaaaaaaaaaaaaaa' to 'zzzzzzzzzzzzzzz' will take quite a long time I think, even on a fast computer.
However, if anyone does have any experience with such applications, their input would be appreciated.
so I don't know how to tell if this function will> produce any collisions for my word list.
It will produce collisions. you have to resolve them.
Does anyone have experience with this function (or anything> similar), and if so, can you tell me if I'm going down the right path,> or is this a non-starter?
use ElfHash / RSHash and compare, and use which suits your application These suited for my 15 Million Words. They produce collisions, but i resolved them So, you have too resolve.
If you wanna avoid collisions, then use these algorithms
- MD5 - crc-16 / 32
google MD5 + wiki or crc-16 + wiki
But, they produce 32 characters unique value, with no collisions, but you can't map the 32-char values to your hash table length - 200,000. If you do so, u will get hash values which will have collisions.
-- Ramprasad B
__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Ok, I can do that. However, I would also like to know if there might be invalid> anagrams or letter combinations that will produce a collision, because I need to> distinguish good words from bad ones.
Write an word parser for finding legal words.
For example, if I have "aa",> "be", "ch" and "de" in my word list, I want to be sure that no other words, for> example, "yz", produce the same hash as the good words.
Resolve the collision. you can't avoid collisions without taking care of resolving collisions.
Checking this manually> for the set 'aaaaaaaaaaaaaaa' to 'zzzzzzzzzzzzzzz' will take quite a long time I> think, even on a fast computer.
That's sick way of writing a program.
However, if anyone does have any experience with such applications, their input> would be appreciated.
Mine is above.
-- Ramprasad B
__________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
On 3/20/06, Ramprasad B <ramprasad_i82-/E1597aS9LQAvxtiuMwx3w@public.gmane.org> wrote:> --- Anand Buddhdev <arb-UOPLgynvtKHYtjvyW6yDsg@public.gmane.org> wrote:>
Ok, I can do that. However, I would also like to know if there might be invalid> > anagrams or letter combinations that will produce a collision, because I need to> > distinguish good words from bad ones.>
Write an word parser for finding legal words.>
For example, if I have "aa",> > "be", "ch" and "de" in my word list, I want to be sure that no other words, for> > example, "yz", produce the same hash as the good words.
The usual point of hashing is to reduce a large domain values (your words) to a smaller domain (the hash values). If the domain of the large set is larger than the domain of the smaller set then you *will* get collisions.
Mapping the alphabetic characters A-Z to a three-bit hash for example will produce collisions. The aim of a good hash function is to spread the collisions so one particular hash value will not selected more often than others.
If you want to map the large domain to another domain on a one to one relationship, you do not want hashing. Try a cipher instead if you don't want to use the words themselves (the simplest method)
Resolve the collision.> you can't avoid collisions without taking> care of resolving collisions.>
Checking this manually> > for the set 'aaaaaaaaaaaaaaa' to 'zzzzzzzzzzzzzzz' will take quite a long time I> > think, even on a fast computer.>
That's sick way of writing a program.
It's also pointless I've demonstrated mathematically above why.
However, if anyone does have any experience with such applications, their input> > would be appreciated.
If you want unique values, use a cipher or some other method that does not result in 'loss' of data. Otherwise use a hash and deal with the collisions.
Anand Buddhdev 20 March 2006 20:20:21 [ permanent link ]
Paul Herring <pauljherring@...> writes:
If you want to map the large domain to another domain on a one to one> relationship, you do not want hashing. Try a cipher instead if you> don't want to use the words themselves (the simplest method)
Hi Paul,
Thank you very much for this! It has cleared up the confusion I had between hashes and ciphers. I'll go away and do some reading on ciphers now.
If you would like to report an abuse of our service, such as a spam message, please . Если Вы хотите пожаловаться на содержимое этой страницы, пожалуйста .