Implement a hashtable that uses either chaining or open addressing to resolve collisions. This hashtable should simply store words as both key and data (essentially there is no data just keys). Duplicate words should not be added. Your task is to experiment with the performance of various hashing functions on the following operation: Counting the number of unique words in an input file.

The way this operation is done is as follows. Create an initially empty hashtable H. Open the input file. For every word w in the input file, if w is already in H (found via a look-up in the hashtable), do nothing. If it is not, add w to the hashtable (via an insertion). After processing the file, simply return the number of words in the hashtable.

A word in a file is defined as a sequence of continuous alphabetic characters (case insensitive - convert to upper-case). So the following line: "I emailed sk8er Joe about the meeting at 7:30 -notAT 8:30." has 10 unique words: I, EMAILED, SK, ER, JOE, ABOUT, THE, MEETING, AT, NOT

Note that run, runs, running are all different words. That is, you don’t need to compute the stem of a word.

The way this operation is done is as follows. Create an initially empty hashtable H. Open the input file. For every word w in the input file, if w is already in H (found via a look-up in the hashtable), do nothing. If it is not, add w to the hashtable (via an insertion). After processing the file, simply return the number of words in the hashtable.

A word in a file is defined as a sequence of continuous alphabetic characters (case insensitive - convert to upper-case). So the following line: "I emailed sk8er Joe about the meeting at 7:30 -notAT 8:30." has 10 unique words: I, EMAILED, SK, ER, JOE, ABOUT, THE, MEETING, AT, NOT

Note that run, runs, running are all different words. That is, you don’t need to compute the stem of a word.

Testing your hashing performance is fairly easy. Just run the unique-words operation on different sized files (at least 20 of them) and time how long it takes for the different hashing code schemes.

To help aid in keeping everything else constant, be sure the only thing different is the hashing function used. In particular, keep the load factor fixed, at say 0.5 or 0.75, and the method of collision resolution constant. When using a probabilistic hashing function, that is, where a random value a is chosen at the start, I recommend testing the performance on the same file multiple times and taking the average so that one bad random value does not ruin the overall picture of the performance.

To help aid in keeping everything else constant, be sure the only thing different is the hashing function used. In particular, keep the load factor fixed, at say 0.5 or 0.75, and the method of collision resolution constant. When using a probabilistic hashing function, that is, where a random value a is chosen at the start, I recommend testing the performance on the same file multiple times and taking the average so that one bad random value does not ruin the overall picture of the performance.

1. Bad: The sum of the ASCII values of each character in the string.

2. Bad: The product of the ASCII values of each character in the string.

3. Good: Poly Hash of the ASCII values of each character in the string.

4. Two other functions of your own choosing (and research).

2. Bad: The product of the ASCII values of each character in the string.

3. Good: Poly Hash of the ASCII values of each character in the string.

4. Two other functions of your own choosing (and research).

Any text file.

Array size is 96

Time for Adding is 10

Time for Multiplying is 4

Time for Poly Hash is 3

Time for Division Hash is 3

Time for Rolling Hash is 3

***