What is OPML?
Simple and fast hash function for strings
Hello Guest
  
  • Login
• Register…
• Start blog
  • Who, Where, When
• What is interesting here?
• Duels
  • Polls
• Avatars
• Interests
  • Cities and Countries
• Random blog
• Users search
  • Search
• Games
• Tests
• QAIX
  • Сообщества
• Talxy Chat
• Horoscope
• Online
 
Register!

QAIX > C/C++ Programming > Simple and fast hash function for strings 15 November 2009 01:24:57

  Top users: 
  Recent blog posts: 
  They have birthday today: 
  Forums:   
  Discuss: 
  Recent forum topics: 
  Recent forum comments:
  Модератор:

Simple and fast hash function for strings

Anand Buddhdev 20 March 2006 01:36:39
 Hi everyone,

This is my first post to this newsgroup, and I'm quite new to C
programming, so please be gentle :)­

I'm writing a small program in C that needs to take one or more words of
input from the user, and then look these words up in a "dictionary" and
inform the user whether all the words are good or not. For legal
reasons, the dictionary cannot be in plain text, so I've been thinking
about taking all the words in the dictionary and running them through a
hash function, and using the resulting dictionary of hashes for
lookups. I found a function posted by Daniel J Bernstein (djb hash)
which looks suitable for this purpose. However, I'm not trained that
well in mathematics, so I don't know how to tell if this function will
produce any collisions for my word list. My word list consists of
approximately 200,000 words, and the maximum length of a word is 15
characters. Does anyone have experience with this function (or anything
similar), and if so, can you tell me if I'm going down the right path,
or is this a non-starter?

The djb hash function is:

unsigned long hash(unsigned char *str) {
В В unsigned long hash = 5381;
В В int c;
В В while (c = *str++)
В В В В hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
В В return hash;
}

--
Anand






Add comment
Andrew Clarke 20 March 2006 16:41:50 permanent link ]
 On Sun, Mar 19, 2006 at 10:36:39PM +0100, Anand Buddhdev wrote:
This is my first post to this newsgroup, and I'm quite new to C> programming, so please be gentle :)­

c-prog is a mailing list, not a newsgroup...
I'm writing a small program in C that needs to take one or more words of> input from the user, and then look these words up in a "dictionary" and> inform the user whether all the words are good or not. For legal> reasons, the dictionary cannot be in plain text, so I've been thinking> about taking all the words in the dictionary and running them through a> hash function, and using the resulting dictionary of hashes for> lookups. I found a function posted by Daniel J Bernstein (djb hash)> which looks suitable for this purpose. However, I'm not trained that> well in mathematics, so I don't know how to tell if this function will> produce any collisions for my word list. My word list consists of> approximately 200,000 words, and the maximum length of a word is 15> characters. Does anyone have experience with this function (or anything> similar), and if so, can you tell me if I'm going down the right path,> or is this a non-starter?

You should be able to write a simple program to store the entire hash
table in memory and check for duplicate hashes.

Your hash() function returns an unsigned long, which is 4 bytes wide on
most systems, so the hash table will be 200,000 * 4 = 800,000 bytes,
which is small for any modern PC.

Regards
Andrew


Add comment
Anand Buddhdev 20 March 2006 16:59:08 permanent link ]
 andrew clarke <mail@...> writes:
This is my first post to this newsgroup, and I'm quite new to C> > programming, so please be gentle :)­>
c-prog is a mailing list, not a newsgroup...

I found that out after I had posted the message via Gmane's NNTP interface.
Apologies for using the wrong words.
I'm writing a small program in C that needs to take one or more words of> > input from the user, and then look these words up in a "dictionary" and> > inform the user whether all the words are good or not. For legal> > reasons, the dictionary cannot be in plain text, so I've been thinking> > about taking all the words in the dictionary and running them through a> > hash function, and using the resulting dictionary of hashes for> > lookups. I found a function posted by Daniel J Bernstein (djb hash)> > which looks suitable for this purpose. However, I'm not trained that> > well in mathematics, so I don't know how to tell if this function will> > produce any collisions for my word list. My word list consists of> > approximately 200,000 words, and the maximum length of a word is 15> > characters. Does anyone have experience with this function (or anything> > similar), and if so, can you tell me if I'm going down the right path,> > or is this a non-starter?>
You should be able to write a simple program to store the entire hash> table in memory and check for duplicate hashes.

Ok, I can do that. However, I would also like to know if there might be invalid
anagrams or letter combinations that will produce a collision, because I need to
distinguish good words from bad ones. But perhaps that is a question for a
cryptography list, and not C programming per se. For example, if I have "aa",
"be", "ch" and "de" in my word list, I want to be sure that no other words, for
example, "yz", produce the same hash as the good words. Checking this manually
for the set 'aaaaaaaaaaaaaaa' to 'zzzzzzzzzzzzzzz' will take quite a long time I
think, even on a fast computer.

However, if anyone does have any experience with such applications, their input
would be appreciated.

Greetings,

--
Anand



Add comment
Ramprasad B 20 March 2006 16:59:47 permanent link ]
 --- Anand Buddhdev <arb-UOPLgynvtKHYtj­vyW6yDsg@public.gman­e.org> wrote:
so I don't know how to tell if this function will> produce any collisions for my word list.

It will produce collisions.
you have to resolve them.
Does anyone have experience with this function (or anything> similar), and if so, can you tell me if I'm going down the right path,> or is this a non-starter?

use ElfHash / RSHash and compare, and use which suits your application
These suited for my 15 Million Words.
They produce collisions, but i resolved them
So, you have too resolve.

If you wanna avoid collisions, then use these algorithms

- MD5
- crc-16 / 32

google MD5 + wiki or crc-16 + wiki

But, they produce 32 characters unique value, with no collisions, but you can't map
the 32-char values to your hash table length - 200,000.
If you do so, u will get hash values which will have collisions.

--
Ramprasad B


___________________­____________________­___________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.c­om


Add comment
Ramprasad B 20 March 2006 18:15:14 permanent link ]
 --- Anand Buddhdev <arb-UOPLgynvtKHYtj­vyW6yDsg@public.gman­e.org> wrote:
Ok, I can do that. However, I would also like to know if there might be invalid> anagrams or letter combinations that will produce a collision, because I need to> distinguish good words from bad ones.

Write an word parser for finding legal words.
For example, if I have "aa",> "be", "ch" and "de" in my word list, I want to be sure that no other words, for> example, "yz", produce the same hash as the good words.

Resolve the collision.
you can't avoid collisions without taking
care of resolving collisions.
Checking this manually> for the set 'aaaaaaaaaaaaaaa' to 'zzzzzzzzzzzzzzz' will take quite a long time I> think, even on a fast computer.

That's sick way of writing a program.
However, if anyone does have any experience with such applications, their input> would be appreciated.

Mine is above.

--
Ramprasad B

___________________­____________________­___________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.c­om


Add comment
Paul Herring 20 March 2006 19:01:42 permanent link ]
 On 3/20/06, Ramprasad B <ramprasad_i82-/E15­97aS9LQAvxtiuMwx3w@p­ublic.gmane.org> wrote:> --- Anand Buddhdev <arb-UOPLgynvtKHYtj­vyW6yDsg@public.gman­e.org> wrote:>
Ok, I can do that. However, I would also like to know if there might be invalid> > anagrams or letter combinations that will produce a collision, because I need to> > distinguish good words from bad ones.>
Write an word parser for finding legal words.>
For example, if I have "aa",> > "be", "ch" and "de" in my word list, I want to be sure that no other words, for> > example, "yz", produce the same hash as the good words.

The usual point of hashing is to reduce a large domain values (your
words) to a smaller domain (the hash values). If the domain of the
large set is larger than the domain of the smaller set then you *will*
get collisions.

Mapping the alphabetic characters A-Z to a three-bit hash for example
will produce collisions. The aim of a good hash function is to spread
the collisions so one particular hash value will not selected more
often than others.

If you want to map the large domain to another domain on a one to one
relationship, you do not want hashing. Try a cipher instead if you
don't want to use the words themselves (the simplest method)
Resolve the collision.> you can't avoid collisions without taking> care of resolving collisions.>
Checking this manually> > for the set 'aaaaaaaaaaaaaaa' to 'zzzzzzzzzzzzzzz' will take quite a long time I> > think, even on a fast computer.>
That's sick way of writing a program.

It's also pointless :)­ I've demonstrated mathematically above why.
However, if anyone does have any experience with such applications, their input> > would be appreciated.

If you want unique values, use a cipher or some other method that does
not result in 'loss' of data. Otherwise use a hash and deal with the
collisions.

--
PJH

#define EPSILON LLONG_MIN


Add comment
Anand Buddhdev 20 March 2006 20:20:21 permanent link ]
 Paul Herring <pauljherring@...> writes:
If you want to map the large domain to another domain on a one to one> relationship, you do not want hashing. Try a cipher instead if you> don't want to use the words themselves (the simplest method)

Hi Paul,

Thank you very much for this! It has cleared up the confusion I had between
hashes and ciphers. I'll go away and do some reading on ciphers now.

--
Anand



Add comment
bahaa 15 November 2009 01:24:57 permanent link ]
 plz somebody help me........

my prof. ask me to do Like AI chat program ,,, when i ask the computer answer it using C#

he just give me JUST one hint Hash Function and i still looking for it,,,,,


PLZ I need help i study computer animation and the course IS AI for gaming ,,,,


No related But he hates me .....\



Anyone Can help ...

bahaa58@hotmail.com
Add comment
 

Add new comment

As:
Login:  Password:  
 
 
  
 
Пожалуйста, относитесь к собеседникам уважительно, не используйте нецензурные слова, не злоупотребляйте заглавными буквами, не публикуйте рекламу и объявления о купле/продаже, а также материалы нарушающие сетевой этикет или законы РФ. Ваш ip-адрес записывается.


QAIX > C/C++ Programming > Simple and fast hash function for strings 15 November 2009 01:24:57

see also:
Can a parent docoment call a function…
A object is passed to a function using…
k
pass tests:
KaKaya ti HEKA? -)
see also:
How to rip DVD to…
Video Format Total Solution: How to…
How to rip DVD to…

  Copyright © 2001—2010 QAIX
Идея: Монашёв Михаил.
Авторами текстов, изображений и видео, размещённых на этой странице, являются пользователи сайта.
See Help and FAQ in the community support.qaix.com.
Write in the community about the bugs you have noticedbugs.qaix.com.
Write your offers and comments in the communities suggest.qaix.com.
Information for parents.
Пишите нам на .
If you would like to report an abuse of our service, such as a spam message, please .
Если Вы хотите пожаловаться на содержимое этой страницы, пожалуйста .