Unit 9 - Sentiment analysis - Python Text Analysis

Sentiment analysis is a computational method to extract ‘sentiment’ from text. What ‘sentiment’ means varies, but in general, it is the positive or negative opinion about something that is conveyed in the text.

The most basic methods have a dictionary of positive and negative words and then check to see how many of those are in the text in question. You can do fancy calculations, but a difference or relative proportion is common.

Sentiment analysis is used widely, from online posts and reviews to survey data and news. There are many issues with using it without fully understanding what the numbers mean. Do check the reading on Canvas to explore more about it:

Taboada, M. (2016) Sentiment analysis: An overview from linguistics. Annual Review of Linguistics 2: 325-347.

Sentiment analysis with VADER¶

VADER is a lexicon-based system for sentiment analysis. It takes a text (or post, headline, etc.) and provides two scores:

Proportion of the text with positive, negative, or neutral words
- This just counts the words in the text that are also in the dictionary and calculates the ratio
Composite score
- Calculated using the dictionary, plus applying a set of rules to deal with negation and other linguistic phenomena that may change the score

There is an implementation of VADER in NLTK, so we will use that version.

Import statements¶

import os
import pandas
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# create an instance of the analyzer
analyzer = SentimentIntensityAnalyzer()

Try with toy sentences¶

A couple of toy sentences and a list of examples from the VADER repository.

sent1 = "I really like you."

print(analyzer.polarity_scores(sent1))

sent2 = "I really really really very much like you."

print(analyzer.polarity_scores(sent2))

sent3 = "I had a horrible, terrible, no good day"

print(analyzer.polarity_scores(sent3))

sent3 = "I haven't read a better book"

print(analyzer.polarity_scores(sent3))

# these are sample sentences from VADER
sentences = ["VADER is smart, handsome, and funny.",  # positive sentence example
             "VADER is smart, handsome, and funny!",  # punctuation emphasis handled correctly (sentiment intensity adjusted)
             "VADER is very smart, handsome, and funny.", # booster words handled correctly (sentiment intensity adjusted)
             "VADER is VERY SMART, handsome, and FUNNY.",  # emphasis for ALLCAPS handled
             "VADER is VERY SMART, handsome, and FUNNY!!!", # combination of signals - VADER appropriately adjusts intensity
             "VADER is VERY SMART, uber handsome, and FRIGGIN FUNNY!!!", # booster words & punctuation make this close to ceiling for score
             "VADER is not smart, handsome, nor funny.",  # negation sentence example
             "The book was good.",  # positive sentence
             "At least it isn't a horrible book.",  # negated negative sentence with contraction
             "The book was only kind of good.", # qualified positive sentence is handled correctly (intensity adjusted)
             "The plot was good, but the characters are uncompelling and the dialog is not great.", # mixed negation sentence
             "Today SUX!",  # negative slang with capitalization emphasis
             "Today only kinda sux! But I'll get by, lol", # mixed sentiment example with slang and constrastive conjunction "but"
             "Make sure you :) or :D today!",  # emoticons handled
             "Catch utf-8 emoji such as such as 💘 and 💋 and 😁",  # emojis handled
             "Not bad at all"  # Capitalized negation
             ]

for sentence in sentences:
    vs = analyzer.polarity_scores(sentence)
    print("{:-<60} {}".format(sentence, str(vs)))

# another way to print
for sentence in sentences:
    scores = analyzer.polarity_scores(sentence)
    print(sentence)
    print("   ", scores)

Movie reviews¶

Next, we will analyze some movie reviews from the SFU Review Corpus. They are in the ‘reviews’ directory.

We will just do the usual process of defining a function to calculate the score and another function to process the directory and call the calculating function. Note that I am creating the analyzer here again. This is because you want to be able to run this function by itself, without the statement above.

def get_sentiment_scores(text):
    """
    Uses VADER within NLTK to calculate sentiment
    
    Args:
        text (str): a string containing the file or text
        
    Returns: 
        dict: a dictionary that VADER creates
    """
    analyzer = SentimentIntensityAnalyzer()
    score = analyzer.polarity_scores(text)
    return score

def process_dir(path):
    """
    Reads all the files in a directory. Processes them using the 'get_sentiment_scores' function
    
    Args: 
        path (str): path to the directory where the files are
        
    Returns:
        dict: a dictionary with file names as keys and the tokens, types, lexical diversity, as values
    
    """
    scores = {}

    for filename in os.listdir(path):
        if filename.endswith(".txt"):    
            file_path = os.path.join(path, filename)      
            with open(file_path, 'r', encoding="utf-8") as f:
                text = f.read()
                scores[filename] = get_sentiment_scores(text)
    return scores

path = './reviews'

scores_files = process_dir(path)

print(scores_files)

# of course, pandas makes everything look better
df = pandas.DataFrame.from_dict(scores_files, orient="index")
df

Summary¶

We have learned about sentiment analysis and how to run a ‘classic’ method for sentiment, VADER.

Read the files and compare the scores to the text in the files. Do you agree with what VADER says? Why or why not?

References¶

Taboada, M. (2016). Sentiment Analysis: An Overview from Linguistics. Annual Review of Linguistics, 2(1), 325–347. 10.1146/annurev-linguistics-011415-040518