Search Engine Construction

Skip to end of metadata
Go to start of metadata

Due: Wed, Nov 2, 2011

Goals and requirements

Your goal in this project is to construct a search engine for a static corpus, namly the Enron e-mail collection. Due to disk space limitations on our small Amazon machines, however, I will give you a subset of the complete e-mail dump: 100 e-mail messages for each of the 150 mail addresses.

You should create index in memory (dictionary and posting lists), but store the e-mail files on disk. Set up your Web server configuration so that the files are available visible with the URI structure mirroring the disk structure (except for the mistake I made with the extra level of directory; e.g., "1./1." instead of "1."). For example, we want URI /enron/motley-m/inbox/16 for file

enron_mail_20110402/maildir/motley-m/inbox/16./16.

Note the removal of the '.' on the end of the filename in the URI. The easiest way to do this is simply remove that '.' from each filename on disk.

You are free to use any of the Python code we built on page TFIDF.

Assuming a free text search, not a Boolean AND or OR condition in the terms. In other words, it's possible that documents may come up that do not include all of the search terms.

We will, in principle, be using a cosine similarity between query and documents except that it will look like a simple scoring by "sum of weights" because our queries are short: we will use 0 or 1 as the weight for query terms. So for query "burn California burn" the resulting vector weights are 1 for terms burn and California. There would be an implicit zero in all other positions.

You can use either a binary tree (b-tree) or hash table to create the dictionary component and I recommend variable length arrays for the postings lists. You will also need an array that maps document ID to filename (or URI). I would not recommend storing the posting lists on the disk since everything should fit in memory for our small data size. Certainly if you try to store one posting list for file, you will run out of inodes on our small 8G-disk machines.

You should make search queries as efficient as possible and return the most relevant e-mails to the top of the list. That means using a heap as a priority queue to get the top K search results. Create a heap from the results and then get the top results in the proper order. See Python module heapq. Or, you could simply brute force sort the search results by score.

Because it will take a while to construct the index, create a separate program that constructs and persists your index to disk with cpickle and reload upon server start up.

If you use heuristics to increase the speed of your queries, you must document this on a separate sheet so I don't miss it as a comment in your code. Heuristics can alter the order or set of documents that appear in the results in exchange for increased speed.

Example

Here is email file "16." within Motley's directory:

~/USF/CS680/data/enron_mail_20110402/maildir/motley-m/inbox $ cat 16.
Message-ID: <8959521.1075841336562.JavaMail.evans@thyme>
Date: Fri, 15 Mar 2002 12:50:27 -0800 (PST)
From: virginia.thompson@enron.com
To: matt.motley@enron.com
Subject: phone number for Malowney
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Thompson, Virginia </O=ENRON/OU=NA/CN=RECIPIENTS/CN=VTHOMPSO>
X-To: Motley, Matt </O=ENRON/OU=NA/CN=RECIPIENTS/CN=Mmotley>
X-cc:
X-bcc:
X-Folder: \ExMerge - Motley, Matt\Inbox
X-Origin: MOTLEY-M
X-FileName: matt motley 6-26-02.pst

Matt-

  I talked to John Malowney today (Friday, March 15) and he asked if I would
 send you an e-mail and ask you to call him on Tuesday (March 19) at (503)
 833-4526.

  Thanks,

  Virginia

A search results for this should look like:

phone number for Malowney
From: virginia.thompson@enron.com
To: matt.motley@enron.com
Matt- I talked to John Malowney today (Friday, March 15) and he asked if I would send you...

Strip out newlines, combined spaces, and display the first, say, 80 characters of the e-mail message.

Server

Because there is no persistence across wsgi calls using Apache, we cannot use it. One student suggested gevent, but there are lots more to choose from. You are all free to use the server of your choice as long as it can do wsgi using a shared Python VM across wsgi calls (otherwise you would have to create index every time you made a URI reference).

Deliverables

You must provide a URI that return search results:

Show the top 20 results in the browser and show a "next page" link that will move to the next 20 results etc...

The results in the browser should look something like google results where the link has the Subject line extracted from the e-mail, the from address, the to address, and the first 80 characters of the file (strip whitespace so it looks nice). The link should go to the URI for the appropriate e-mail file for which you also must provide functionality:

As usual, I will need a print out of your Python code. There will be no late projects accepted nor even late print outs. I will grade whatever is available at the start of class.

Labels: