Sentiment Classification with SVMs

Solution Outline

Our problem basically required us to try and classify movie reviews into two categories: positive and negative reviews. Each movie review was essentially a text file with user comments about a particular movie. In order to accomplish this, we used a Support Vector Machine (SVM). In particular, we used Thorsten Joachim's SVMlight package. All the default parameters were used and a linear kernel was used for the SVM calculations. Each movie review was fed into the SVM in the form of a feature vector: each word was represented as a single feature of each movie review document. To improve accuracy, we tried different representations of the features: presence representations with a 1 for the mere presence of a word and 0 for its absence, and frequency representations with numerical weights that corresponded to the frequency of each word in the document. We also tried different feature extraction methods such as: using only nouns and pronouns, adjectives, adverbs and verbs, and features that only contained letters and numbers. In all cases, we removed stop words - common words found in the English language that provide little meaning to sentences. The list of stop words we used can be found here. We also ran the data through a Naive Bayes as well to compare performance results with the SVM.