CS 302 - Data Structures & Algorithms - Spring 2008
Programming Assignment 3


Loyola College > Department of Computer Science > CS 302 > Programming Assigngments > Programming Assignment 3

R is for Retrieval

(or lets play with hash tables)


Due

Design: Monday, February 25, 2008
Project: Wednesday, March 12, 2008 Monday, March 17, 2008 at the beginning of class

Background

A search engine builds what is known as an inverted index: a list of documents that contain a specific vocabulary word. The search engine can then quickly retrieve all documents that contain a word using the inverted index. Given a multi-word query, the search engine can intersect the retrieved documents to find the set of documents that includes all words from the query. (Generally, the frequency of the query words in a document is used to help determine the position of a document in a ranked list of documents.)

To reduce the size of search space, a search engine also employs a stop-list: a list of words nobody cares about. For example, the English words "the" and "is" normally do not help identify a document.

The Assignment

The input to your program includes, as command-line arguments, a file name for the stop-list followed by a list of document file names. The rest of the input is queries read from standard input. From the document files, your program should first build an inverted index and then accept and respond to queries. The output for a query, printed to standard out, includes the file names and optionally the contents of all the documents that include all the non-stop-list words from the query.

Your program must store the inverted index as a hash table of pairs that contains the word and the collection of documents containing that word. (This hash table is essentially a mapping from words to the documents that contain that word.) Your program must also store the stop-list words in a (perhaps different) hash table. The stop-list file includes one entry per line.

While punctuation should be maintained in the returned documents, it should not be included when comparing words. For example, if asked "Does the previous sentence include `words'?", the answer is yes although the actual string is `words.'. Words include only upper and lower case letters (see isalpha[3]).

What to hand in

  1. Design: (by 2/25) A one to two page program design.
  2. (the rest is due on 3/12 3/17) A well formated 2-up listing of your code. You might use something like
        a2ps -q -Avirtual -2 -o mycode.ps  <files>
    
    Preview the page breaks and general appearance using gv mycode.ps.
  3. A black-box test plan (remember in-out-rational!) and the output from the execution of your program on this test plan. Include a 2up printout of the test-plan exectued on the p3-names-only version with your code printout.
  4. Your graded design.
  5. Two executables (runnable by me), named p3-outputs-docs and p3-names-only, placed in your home directory.
  6. Email a gzipped (gzip[1]) tar file named <your-name>.tar.gz that includes your source, makefile, test plan, named <your-name>-test-plan, the captured output, named <your-name>-out, of your test plan running on your code (i.e., the output of script[1]), and any auxiliary files that you have accumulated. Please don't include the test documents!

Notes

  1. Include C pre-processor directives so that gcc -DINCLUDE_DOCS ... builds a program that outputs file names and the optional contents. Without the option -DINCLUDE_DOCS, only the file names are printed. The code will look something like
        print_document_name(document);
      #ifdef INCLUDE_DOCS 
        print_document_contents(document);
      #endif
    
  2. Your data structures must include at least two hash tables.
  3. You may use any hash-table collision resolution scheme you like; however, you must have one.
  4. You can assume the size of a file will be less than some constant. My program includes
    #define MAX_FILE_LENGTH 2048
  5. Speed matters 7, 4, and 1 bonus points to the top three fasted search engines.
    Plus! an additional 10 bonus points for beating Dr. Binkley's blindingly fast program :)
  6. Tests and the stop-list exist in /cs302/proj3-data. For example, from their, run
        <your-program> stop-list docs/* < queries1