CS 302 - Data Structures & Algorithms - Spring 2005
Programming Assignment 2


Loyola College > Department of Computer Science > CS 302 > Programming Assigngments > Programming Assignment 2

Passage Retrieval with a Templated Hash Table

Due

Wednesday, February 16th at 11:59pm.
Turn in a print out of the assignment Friday, February 18th at the beginning of class. Be sure to include the Honor Code Statement:
"I hereby declare that I have abided by the Honor Code during this assignment."

Introduction

Information Retrieval relies on hash tables in order to ensure that documents are retrieved as quickly as possible. However, hashing words is only one part of the solution to fast retrieval. Another is the actual information stored. Search engines build what are known as inverted indexes. An inverted indexed is a list of documents that contain a specific vocabulary word that is used as the key to the hash table. Generally, the number of times a query word occurs in a document is used to help determine the position of a document in the ranked list of document. However, for this assignment, we will be performing Boolean search. This means that a document is retrieved if and only if it contains all of the query words, no matter the frequency of the query words. In addition, we will perform retrieval on passages containing 100 words from the original document. This means that our passages are analogous to documents in another search engine.

Assignment

  1. Use OOP programming to design the solution to this problem. You will submit a UML diagram of your solution. This project will include several classes, including a hash table. Apart from the hash table, you may use the STL for any other data structure you wish to use, but you do not have to.
  2. Your program should expect a number of command-line arguments. The first argument is the stoplist and then any number of documents which you will index for retrieval.
  3. To index the documents, you will divide them into passages of 100 words and store the exact text including punctuation and the spaces or newlines where they occur in the text. (You may use peek to find out what type of whitespace you will skip with the standard C++ input operator).
  4. You should hash each non-stopword and make a list of documents that include the word. A word should not include punctuation or whitespace. The keys and values of your hash table should be templated types, and the hash function will be data field in the hash class. The hash function takes as parameters something of KeyType, and an int that is the size of the hash table. This function will return an int which is the index where the key hashes. You may use any collision resolution scheme you like; however you must have one. We will discuss both hash functions and collision resolution schemes in class.
  5. Once you have created the inverted index and the document representation, print out a welcome message that explains the program and asks the user to enter a query.
  6. You should print each passage that contains all the query words.
  7. Continue to ask for new queries until the user indicates that she is ready to quit.
  8. Use constants where appropriate.
  9. You may use either C++ or C style memory management and file io. Make sure that you are consistant. You should use dynamic memory allocation where appropriate to limit the amount of memory required to run your program.

Extra Credit

  1. (15 points) Keep track of the number of times each non-stopword occurs in a passage, the total number of times a word occurs, and the number of passages you create. Order the results of your search using the following formula:
    where t is a word in the query Q,
    nt is the number of times the word t occurs in the passages
    nd is the number of documents in collection (i.e. number of passages you indexed), and
    df is the number of documents (or passages) in the collection that contain the word t.
    This formula is known as tf.idf (term frequency.inverse document frequency).
  2. (5 points) Create over-lapping passages. Since we are using passages of 100 words, each passage will consist of 50 words from the previous passage and 50 new words.

Data

In the /cs302/data directory there is a directory called docs. Use these documents to test your project. They are all news documents about the Monica Lewinski Scandal.

Submission