Due: Dec 5, 2011 (2nd to last day of class)
Goals
Your goal in this project is to learn about unsupervised and supervised classification. You will do both k-means clustering and naive Bayes classifiers. We will be using a data set from Reuters that is a set of newswire articles. First, you will perform 4-means clustering to see how the clustering algorithm does compared to the known categorization from Reuters. Then, you will train a naive Bayes classifier to distinguish between the 4 categories (see below for the categories).
Clustering. After you have performed clustering, you should measure the difference between your categorization and the Reuters <code code="..."> categorization. There will likely be a difference in the categorization and it is not necessarily an error in your clustering program. But, I want your analysis of the difference in your report.
Classifier. Given an unknown document, you must determine which category it belongs in. Most of the details are left up to you and are part of the learning experience.
You must provide a URI as a REST service that returns a category name in response to a POST request equivalent to the following URI:
/classify?xml=foo
where foo is XML data file content from another Reuters file that your classifier has not seen. The only difference, of course, is that we will not include the metadata tags that indicate the categorization. We will send the xml data to your server and you return either GPOL, GCRIM, GSPO, or GDEF. And I mean literally those 4 or 5 characters. That is to make it easier for us to grade.
The training Data
The data is in XML format and looks like this:
<?xml version="1.0" encoding="iso-8859-1"?> <newsitem itemid="4433" id="root" date="1996-08-20" xml:lang="en"> <title>SWITZERLAND: India confirms nuclear test ban veto.</title> <headline>India confirms nuclear test ban veto.</headline> <byline>Philippe Naughton</byline> <dateline>GENEVA 1996-08-20</dateline> <text> <p>India told the Conference on Disarmament on Tuesday that it could still not accept a draft nuclear text ban treaty, preventing its formal adoption in Geneva after more than two years of negotiations.</p> ... </text> ... <metadata> <codes class="bip:topics:1.0"> <code code="GDIP">...</code> ... </codes> </metadata> ... </newsitem>
You can only use title, text for training as that is what we will use for testing.
The data dir zip file is
/home/public/cs680/reuters-4-categories.tar.gz
|
The data should not be made public as I had to sign a license to get access to the data. Please do not put this on any public websites or pass around to anybody outside of school. |
If you find a data file that is categorized into 2 or more of our categories, then you can delete them. I have already deleted about 100 files from the GPOL category that were also categorized as one of the others. Please let me know if there is significant overlap in the other categories.
You will have to figure out how many files to keep out for testing and how many to use for training.
File selection
After looking at the data, I decided to pick 4 categories containing total 1380 files. The categories are:
GPOL DOMESTIC POLITICS 101 files GCRIM CRIME, LAW ENFORCEMENT 336 GSPO SPORTS 387 GDEF DEFENCE 564
The sampling of files histogram can be easily obtained with the following magic incantation of the commandline (this is on a subset of our data):
$ grep code *.xml | grep -v /code|grep -v class | awk '{print $3}' \
|sort |uniq -c |sort -r -n
1210 code="CCAT"
925 code="USA"
640 code="MCAT"
622 code="GCAT"
495 code="C15"
452 code="ECAT"
304 code="C151"
296 code="M14"
246 code="UK"
221 code="C152"
209 code="E21"
181 code="M13"
174 code="M141"
...
171 code="GPOL"> ## our category
...
100 code="GCRIM"> ##
...
48 code="GSPO"> ##
...
29 code="GDEF"> ##
...
In order to find the files that contain the various categories, obviously you can do the following on the commandline:
$ grep -l 'code="GDEF"' *.xml 2932newsML.xml 3110newsML.xml 3117newsML.xml 3120newsML.xml 3141newsML.xml ...
Extracting data
Python makes it very easy to extract the data. We need title, text from each file of interest. For example, here is how to get the title text:
Feature selection
The number of unique words will be extremely large and so you can't use the entire vocabulary as your feature vector. Part of the project will be figuring out which words to use in your feature vector. I imagine it will still be fairly large. You need to decide what this feature vector is for both clustering and Bayesian classification, though you are free to use different feature vectors for both. You're encouraged to experiment and to learn new techniques. For example, can you use principal components decomposition to reduce the feature vector here? Do you need to do Additive smoothing? Do you need to normalize word counts as the features or can you use binary word presence or absence.
There are lots of things you can tweak in your classifiers to get better or worse classification. Part of your job in this project is to learn about all the details necessary to move from we talk about in the lecture to building a decent system.
Extra credit (10%)
Build a k nearest neighbor classifier in addition to the Bayesian classifier. To get full credit for this component, you must provide a report that describes the accuracy difference between the 2 different classifiers. You should try different values of k to see how it affects accuracy.
Deliverables
- Code to perform clustering, cluster.py, on the set of all xml data files. The program can generate whatever output you want, but you must analyze your results for your report.
- You should turn in a short document that describes the details of your clustering approach and how well you were able to categorize the documents as compared to what Reuters categorized them as. Note I am not looking for you to alter your algorithm until it matches what Reuters has done. They could be very different, which would be very interesting and I would like to hear about it. You can describe all sorts of interesting things about the categorization that your clustering algorithm resulted in. For example, do you think it is a good categorization? Is it better or worse than what Reuters is done by hand? If you compute Fechter means for the categories, are they fairly well separated? you get the idea...
- POST URI: /classify with parameter
xmlcontaining the data to classify. Return the category. - You should turn in a short document that describes the details of your approach for classification and discuss the accuracy of your classifier. There are a number of "goodness" measurements you can make. You can describe what you learned and anything else that you feel is relevant. For example, did you learn anything from clustering that helped you build your classifier?
- If you are doing the extra credit component, you will need a report that analyzes your work and compares to the other techniques.
For the classifier, we will run each student's classifier on a dataset of our choosing and compute the following metric: accuracy - percentage of false positives. We will compare this metric to the other students in the class to arrive at grades for this component.