A search engine builds what is known as an inverted index: a list of documents that contain a specific vocabulary word. The search engine can then quickly retrieve all documents that contain a word using the inverted index. Given a multi-word query, the search engine can intersect the retrieved documents to find the set of documents that includes all words from the query. (Generally, the frequency of the query words in a document is used to help determine the position of a document in a ranked list of documents.)
To reduce the size of search space, a search engine also employs a stop-list: a list of words nobody cares about. For example, the English words "the" and "is" normally do not help identify a document.
The input to your program includes, as command-line arguments, a file name for the stop-list followed by a list of document file names. The rest of the input is queries read from standard input. From the document files, your program should first build an inverted index and then accept and respond to queries. The output for a query, printed to standard out, includes the file names and optionally the contents of all the documents that include all the non-stop-list words from the query.
Your program must store the inverted index as a hash table of pairs that contains the word and the collection of documents containing that word. (This hash table is essentially a mapping from words to the documents that contain that word.) Your program must also store the stop-list words in a (perhaps different) hash table. The stop-list file includes one entry per line.
While punctuation should be maintained in the returned documents, it should not be included when comparing words. For example, if asked "Does the previous sentence include `words'?", the answer is yes although the actual string is `words.'. Words include only upper and lower case letters (see isalpha[3]).
a2ps -q -Avirtual -2 -o mycode.ps <files>
Preview the page breaks and general appearance using gv mycode.ps.
print_document_name(document);
#ifdef INCLUDE_DOCS
print_document_contents(document);
#endif
<your-program> stop-list docs/* < queries1