Sentiment Analysis - Amazon Reviews

This project was completed as part of the MComp Web Design and Development course at Edge Hill University.

Scenario

The scenario for this project was that a company was going to invest in a company that makes apps that are distributed via Amazon. The company wanted to identify the best company to invest in using machine learning to analyse the sentiment of a dataset of Amazon product reviews. Each of the three companies had three apps. The dataset consisted of 20,000 reviews for training the machine learning algorithms and 20,000 reviews to test the algorithms predictions; each row in the dataset consisted of a product ID, the sentiment score and the text of the product review. My task was to identify the best algorithm to complete the task, implement it and identify the best company for investment according to the predictions made by the best algorithm.

Summary

The six machine learning algorithms which were evaluated included Linear Support Vector Machine (Linear SVM), Feed Forward Neural Network, Naïve Bayes, Random Forest, ID3 Decision Tree and K-Nearest Neighbours. The modifier values for each algorithm were experimented with on a trial and error basis, and a total of 72 experiments were completed with the six algorithms and their various modifiers. The performance of the algorithms was evaluated using the f-score, precision and recall. Once complete, the company scores would be calculated using the best algorithms predictions and a company would be identified as being the best for investment.

Implementation

The implementation was completed using Python's Scikit Learn package. Each word of each review was used as a feature (bag of words) and the matrix of features created for the training dataset was used to train the algorithm. Each algorithm was then implemented with its various modifier values, with K-NN being used as the baseline for which all other results were compared. All predictions were then pulled together into a single table and the results were sorted. Graphs were automatically created and stored for each algorithm showing the performance. The best predictions for each algorithm were stored in CSV files to be used later. A table was also created automatically in the LaTeX format to be inserted into the report. After the algorithm evaluation was complete, company analysis commenced.

The company analysis was completed by counting the number of predicted neutral, positive and negative reviews across all applications belonging to the company. The positive and negative review counts were weighted to be worth twice that of the neutral reviews. This was calculated as:

Score = (Neutral + (Positive * 2) – (Negative * 2)) / 2

The best company for investment was the company with the highest score.

Results

The best (highest f-score) algorithm for the sentiment analysis of the reviews was the Linear SVM algorithm (C value = 0.03) with an f-score of 0.779. This application also managed to identify the best company for investment which was Company 2. The algorithm evaluation was completed in a way so that the script could be run and all of the 72 experiments would run in turn with the graphs and other files being created without any intervention required.