This project was completed as part of the MRes Technology (Computer Science) course at the University of Portsmouth.
This project labelled Iranian state-sponsored propaganda tweets for their sentiment automatically and evaluated the performance of five supervised machine learning algorithms for their ability to accurately classify the sentiment contained in the tweets.
This project used tweets from Twitter's election integrity datasets. These datasets consist of tweets from accounts that have been permanently suspended from the Twitter platform as they were identified as belonging to state-sponsored actors and pushed narratives on their behalf.
This project was performed on tweets about the Iranian nuclear deal (JCPOA) as it was an internationally controversial issue, that Iran would likely be spreading propaganda about. To extract tweets about the nuclear deal, several keyterms were used (see here).
These keyterms were chosen after analysting the most frequent terms per month across the three (at the time) Iranian releases. Additionally, the extracted tweets were restricted to English and published between August 2013 and December 2018. Retweets were excluded to prevent the machine learning algorithms from developing a bias towards tweets that appear often as retweets.
To prepare the tweets for labelling, steps were taken to transform the text in order to achieve the best possible results. These steps included: lowercase conversion, normalising accented letters, removing usernames, transforming hashtags, removing URL's, expanding contractions, removing special characters and removing stopwords. These steps were taken to improve the chance of matching words between the lexicon and the tweets, and to speed up processing.
To label the tweets for their sentiment, the SentiWordNet Lexicon was used. This lexicon provides a score between 0 and 1 for both positive and negative sentiments. The lexicon also provides a score between 0 and 1 for objectivity. The sentiment of the tweet was determined by averaging the positive and negative sentiment scores, and identifying which score was higher. If the scores were equal, the tweet was considered neutral.
The machine learning task of sentiment analysis was completed using a metric called Match Percentage Threshold (MPT); this metric represents the number of matches between the tweet and the lexicon as a percentage, with a higher percentage indicating a higher number of matched words. The machine learning tasks were performed across 5 features: unigrams, bigrams, trigrams, unigrams + bigrams, unigrams + bigrams + trigrams. The 5 algorithms used for these experiments were: K-Nearest Neighbours, Decision Tree, Naive Bayes, Support Vector Machine (with a linear kernel), and Random Forest.
Not yet available.