CS 302 - Data Structures & Algorithms - Spring 2005
Programming Assignment 2
Loyola College >
Department of Computer Science >
CS 302 >
Programming Assigngments >
Programming Assignment 2
Passage Retrieval with a Templated Hash Table
Due
Wednesday, February 16th at 11:59pm.
Turn in a print out of the assignment Friday, February 18th at the beginning of class. Be sure to include the Honor Code Statement:
"I hereby declare that I have abided by the Honor Code during this assignment."
Introduction
Information Retrieval relies on hash tables in order to ensure that documents are retrieved
as quickly as possible. However, hashing words is only one part of the solution to fast
retrieval. Another is the actual information stored. Search engines build what are known
as inverted indexes. An inverted indexed is a list of documents that contain a specific
vocabulary word that is used as the key to the hash table. Generally, the number of times
a query word occurs in a document is used to help determine the position of a document
in the ranked list of document. However, for this assignment, we will be performing
Boolean search. This means that a document is retrieved if and only if it contains all of
the query words, no matter the frequency of the query words. In addition, we will
perform retrieval on passages containing 100 words from the original document. This means that
our passages are analogous to documents in another search engine.
Assignment
- Use OOP programming to design the solution to this problem. You will submit a
UML diagram of your solution. This project will include several classes, including a
hash table. Apart from the hash table, you may use the STL for any other data structure
you wish to use, but you do not have to.
- Your program should expect a number of command-line arguments. The first
argument is the stoplist and then any number of documents which you will index for
retrieval.
- To index the documents, you will divide them into passages of 100 words and store
the exact text including punctuation and the spaces or newlines where
they occur in the text. (You may use peek to find out what type of whitespace you will
skip with the standard C++ input operator).
- You should hash each non-stopword and make a list of documents that include the
word. A word should not include punctuation or whitespace. The keys and values of your hash table should be templated types, and the hash function will be data field in the hash class. The hash
function takes as parameters something of KeyType, and an int that is the size of the hash table. This function will return an int which is the index where the key hashes. You may use any collision resolution scheme you like; however you must have one. We will discuss both hash functions and collision resolution schemes in class.
- Once you have created the inverted index and the document representation, print out a
welcome message that explains the program and asks the user to enter a query.
- You should print each passage that contains all the query words.
- Continue to ask for new queries until the user indicates that she is ready to quit.
- Use constants where appropriate.
- You may use either C++ or C style memory management and file io. Make sure that you are consistant. You should use dynamic memory allocation where appropriate to limit the amount of memory required to run your program.
Extra Credit
- (15 points) Keep track of the number of times each non-stopword occurs in a
passage, the total number of times a word occurs, and the number of passages you create.
Order the results
of your search using the following formula:
where t is a word in the query Q,
nt is the number of times the word t occurs in the passages
nd is the number of documents in collection (i.e. number of passages you
indexed), and
df is the number of documents (or passages) in the collection that contain the word t.
This formula is known as tf.idf (term frequency.inverse document frequency).
- (5 points) Create over-lapping passages. Since we are using passages of 100 words,
each passage will consist of 50 words from the previous passage and 50 new words.
Data
In the /cs302/data directory there is a directory called docs. Use these documents to test your project. They are all news documents about the Monica Lewinski Scandal.
Submission
- Copy all code and your makefile to your dropbox.
- Print out your code and turn it in the following class.
- Update your UML diagram to reflect your final submission and turn it in during
class.