Week 12 (4/25-5/1)


Weekly digest

Text classification with Naive Bayes

  • Text processing

  • Naive Bayes classification

  • Laplace smoothing

  • Word clouds



1. Word counts


2. Stop words



Exercise 1

Use the file movie_reviews.zip to create a dataframe with two columns “positive” and “negative”. Rows of the dataframe should be labeled by words appearing in the text of the reviews. The entries of a row should show how many times the given word appears in the text of negative reviews and how many times it appears in positive reviews.


  1. A word is a sequence consisting of letters and possibly the apostrophe. Thus hello and don’t are words.

  2. Capitalization of words should be ignored. That is, we consider hello, Hello and HELLO as the same word.

negative positive
this 20294 17756
show 1390 1791
proved 49 76
to 34325 33630
be 7241 6262
a 39146 42334
waste 674 50
of 34273 38891
minutes 1099 426
precious 27 24

Exercise 2

Write a function rev_probs() which takes as its argument text of a review and returns logarithm of probabilities that the review is positive and negative based on the training data. The function should use naive Bayes with Laplace smoothing to compute the probabilities.


# sample review

review = """I saw this recent Woody Allen film because I\'m a fan of
his work and I make it a point to try to see everything he does, though
the reviews of this film led me to expect a disappointing effort. They were right.
This is a confused movie that can\'t decide whether it wants to be a comedy,
a romantic fantasy, or a drama about female mid-life crisis. It fails at all three.
<br /><br />Alice (Mia Farrow) is a restless middle aged woman who has married into
great wealth and leads a life of aimless luxury with her rather boring husband and
their two small children. This rather mundane plot concept is livened up with such
implausibilities as an old Chinese folk healer who makes her invisible with some magic
herbs, and the ghost of a former lover (with whom she flies over Manhattan). If these
additions sound too fantastic for you, how about something more prosaic, like an affair
with a saxophone player?<br /><br />I was never quite sure of what this mixed up muddle
was trying to say. There are only a handful of truly funny moments in the film,
and the endingis a really preposterous touch of Pollyanna.<br /><br />Rent \'Crimes and
Misdemeanors\' instead, a superbly well-done film that suceeds in combining comedy with
a serious consideration of ethics and morals. Or go back to "Annie Hall" or "Manhattan"."""
negative   -427.788529
positive   -429.078611
dtype: float64