Project: Text classification with naive Bayes

Project weight: 10 points

In this project you will be working with the following two files:

The file is a zipped csv file containing texts of about 25,000 movie reviews. Each review is accompanied by a label, indicating if the review is positive or negative.

The file is a zipped text file with about 7000 posts submitted to eight online newsgroups devoted to different topics: cars, hockey, religion etc. Each post includes information to which newsgroup it was submitted.


Check how well naive Bayes classifier can predict if a movie review is positive or negative based on the text of the review. Likewise, check how well naive Bayes predicts to which newsgroup a post was submitted based on the text of the post. Describe your results.

Here are some questions which you may consider:

  • Which words are most frequent in each class of texts? Do they differ significantly among classes?

  • Analyze examples of texts which have been misclassified. Are the predicted probabilities far off in these cases?

  • Write some text samples on your own and get them classified. Does this work as expected?

  • How does classification accuracy depend on the size of the training set?

  • Anything else you find interesting.

Note. Here is a text file with a list of stopwords that you can use to clean texts:

Note: For this project you must not use sklearn or other other machine learning libraries which have the naive Bayes classifier and text processing tools already implemented (you can use sklearn to split the data into training and test dataframes). You can use standard Python libraries (e.g. re and collections), pandas, numpy, graphics libraries (matplotlib, seaborn, plotly etc.). The goal is to process data and implement the naive Bayes classifier from scratch using these tools.