Implement a hashtable that uses either chaining or open addressing to resolve collisions. This hashtable should simply store words as both key and data (essentially there is no data just keys). Duplicate words should not be added. Your task is to experiment with the performance of various hashing functions on the following operation: Counting the number of unique words in an input file.
The way this operation is done is as follows. Create an initially empty hashtable H. Open the input file. For every word w in the input file, if w is already in H (found via a look-up in the hashtable), do nothing. If it is not, add w to the hashtable (via an insertion). After processing the file, simply return the number of words in the hashtable.
A word in a file is defined as a sequence of continuous alphabetic characters (case insensitive - convert to upper-case). So the following line: "I emailed sk8er Joe about the meeting at 7:30 -notAT 8:30." has 10 unique words: I, EMAILED, SK, ER, JOE, ABOUT, THE, MEETING, AT, NOT
Note that run, runs, running are all different words. That is, you don’t need to compute the stem of a word.
Performance Testing
Testing your hashing performance is fairly easy. Just run the unique-words operation on different sized files (at least 20 of them) and time how long it takes for the different hashing code schemes.
To help aid in keeping everything else constant, be sure the only thing different is the hashing function used. In particular, keep the load factor fixed, at say 0.5 or 0.75, and the method of collision resolution constant. When using a probabilistic hashing function, that is, where a random value a is chosen at the start, I recommend testing the performance on the same file multiple times and taking the average so that one bad random value does not ruin the overall picture of the performance.
Hashing Functions
1. Bad: The sum of the ASCII values of each character in the string. 2. Bad: The product of the ASCII values of each character in the string. 3. Good: Poly Hash of the ASCII values of each character in the string. 4. Two other functions of your own choosing (and research).
Input
Any text file.
Output
EXAMPLE: Array size is 96 Time for Adding is 10 Time for Multiplying is 4 Time for Poly Hash is 3 Time for Division Hash is 3 Time for Rolling Hash is 3
***Time will vary every time it is ran, even if it is on the same file. It will roughly be around the same amount though. Time is measured in milliseconds.***