Unit 8 - Generative AI - Python Text Analysis

Generative Artificial Intelligence refers to a class of methods and algorithms that use Large Language Models to generate text. The most common way we experience that generative capacity is with models like ChatGPT, where we give it an input and it generates an answer.

There are tons of good explanations out there for these models. Terms that you should understand are:

Artificial intelligence (AI) - an attempt to model human intelligence
Machine learning (ML) - learning to identify patterns in data, usually to identify (predict) new patterns in new data
Neural machine learning, neural models - a class of ML models that take as inspiration the structure of the human brain (neurons and connections)
Deep learning - a class of ML models that use neural models, with several layers of ‘neurons’, and a depth of several layers
Large language models (LLMs) - models of language based on the principle of sequences: You can predict the next word/token based on the previous word(s).
Generative AI - models that predict the next word based on LLMs

We are dealing with language in this course, which is why I talk about predicting the next word or token. Generative AI also works for images, video, code, and other types of data.

The basic principle behind most modern Generative AI models is a classic saying from linguist J.R. Firth (1890-1960): “you shall know a word by the company it keeps”. Originally, this meant that you can understand the meaning of a word by looking at the sentences or context it appears in. In modern NLP and LLMs, it means that we can create a model of language based on collocation, the way that words occur together.

A language model is just the encoding of knowledge about likely sequences and word associations. You can create a language model by obtaining statistics about a large corpus of language. You can also create a large language model (LLM) like the ones behind ChatGPT or Llama using deep learning.

To understand this, we are going to work with ngrams (also spelled n-grams). Ngrams are sequences of tokens (usually words, but sometimes also punctuation) of size n. Terms:

ngram / unigram - one token/word
bigram / 2-gram - sequence of two tokens
trigram / 3-gram - sequence of three tokens
4-gram - sequence of four tokens
etc.

We’ll create simple language models from ngrams using NLTK. The intuition about n-grams is that you can predict the next n in a sequence if you know the frequencies of pairs of n items from corpora. To make things simple, let’s think of n as 2. And we will assume n is a word. But you can also calculate ngram frequency for characters, sounds, or sentences.

Thus, we can figure out what the next word is if we know the previous words are. Let’s say that we want to find out the likelihood that the next word in the sequence I really like is you. This is, by the way, what Google suggested when I typed I really like... The first link was to a Carly Rae Jepsen song. We can calculate that as:

P(you | I, really, like )

(1)

The way that formula is written is a 4-gram (a sequence of 4 words). This can be difficult to calculate, especially for less frequent combinations of sentences. So, to make this into a bigram probability, we calculate the following, which reads as “the probability of you given like”:

P(you | like )

(2)

The general formula is below. The probability of $w_i$ given the sequence $w_1$ to $w_{i-1}$ is approximately the probability of $w_i$ given $w_{i-1}$ . So, instead of calculating probabilities for a long sequence of words, we do it for a sequence of 2 words at a time.

P(w_i | w_1, w_2, w_3, ..., w_{i-1} ) \approx P(w_i | w_{i-1})

(3)

Note that above we say “the probability of x given y”. To calculate that, we just count how often any 2 words appear in a large enough corpus. This is what we’ll do in this notebook!

Credits: NLTK LM documentation, N-gram language models, N-gram language modelling with NLTK, Predicting next word using n-gram model NLTK.

Calculating ngrams¶

Import statements¶

We import everything we need, including bits of NLTK. To train only on “important” or content words, we will remove punctuation and stopwords. We’ll use the NLTK Reuters corpus to train, so we need to import that too.

import string
import nltk 
from nltk.corpus import stopwords
from nltk.util import bigrams
from nltk.util import ngrams
from nltk.util import everygrams
from nltk.corpus import reuters 
from nltk import FreqDist 
from nltk import word_tokenize, sent_tokenize 
nltk.download('punkt') 
nltk.download('stopwords') 
nltk.download('reuters')

NLTK ngram functions¶

There are several functions in NLTK that you can use. We start with nltk.bigrams(). It takes a list of tokens as input and gives you all the possible bigrams of words. Then we can compute the frequency distribution of those bigrams.

But the best function is everygrams, which builds as many ngrams as you like from an input. You give it word tokens (but it can also be used with character tokens) and tell it how many types of ngrams to build. In my example, I say 1, 3, which means: give me: unigrams, bigrams, trigrams.

sent1 = "I really like you."

sent1_tokens = nltk.word_tokenize(sent1)

sent1_bi = nltk.bigrams(sent1_tokens)

# compute frequency distribution for all the bigrams in the text
sent1_fdist = nltk.FreqDist(sent1_bi)

for key, value in sent1_fdist.items():
    print(key, value)

sent1_tokens

sent1_bi

# same, with another sentence
# note that "really really" now has a frequency of 2

sent2 = "I really really really very much like you."

sent2_tokens = nltk.word_tokenize(sent2)

sent2_bi = nltk.bigrams(sent2_tokens)

sent2_fdist = nltk.FreqDist(sent2_bi)

for k, v in sent2_fdist.items():
    print(k, v)

# now, everygrams for sent1

sent1_every = everygrams(sent1_tokens, 1, 3)

list(sent1_every)

# build everygrams for sent2
# you can also build them of different length (1-2, 1-3, 1-4, etc)
# and you can try your own sentences

Calculating ngram frequencies from Reuters¶

Ngrams are really useful when we have large numbers of them and their frequencies. In this part, we take all the sentences in the Reuters corpus and count their frequencies. Then, we create a removal_list with all the things that we want to strip (punctuation and stopwords). Then, we create the lists of unigrams, of bigrams and trigrams, padding to the left and to the right. Padding just means adding a special “word” that indicates the beginning and end of a sentence, so that the first and last words also participate in all possible bigrams.

For instance, in I really like you, we could have the following bigrams:

I, really
really, like
like, you

But notice how, unlike the other words, I and you only participate in one bigram. We want to know that that’s because they are the beginning and end of the sentence. Padding adds that information, which here I am representing with the html code <s> and </s>. So then we create the following bigrams:

<s>, I
I, really
really, like
like, you
you, </s>

So, we will first create a set of removal words, punctuation and stopwords that we don’t want to include the in the lists of ngrams. You can see what it contains below.

Next, we import the Reuters sentences and use everygrams to create unigrams, bigrams, and trigrams. We remove those that have words in the removal list.

After that, word_saladis a dictionary with the frequency distribution of those ngrams.

Finally, we use word_salad to create a sequence of segments that start with a certain prompt. The segments are made up of the prompt, plus the most likely next word. Thus, if the prompt is “it will”, then we’ll get the following:

('it', 'will'), 
('it', 'will', 'be'), 
('It', 'will'), 
('it', 'will', 'pay'), 
('it', 'will', 'not'), 
('IT', 'WILL'), 
('it', 'will', 'continue'), 
('it', 'will', 'have'), 
('it', 'will', 'make'), 
('it', 'will', 'take'), 
('It', 'will', 'be'), 
('it', 'will', 'raise'), 
('it', 'will', 'acquire'), 
('it', 'will', 'also'), 
('it', 'will', 'report'), 
('it', 'will', 'offer'), 
('it', 'will', 'issue'), 
('it', 'will', 'receive'), 
('it', 'will', 'increase'), 
('it', 'will', 'sell'),
 etc.

# create the list of things we'll remove (some punctuation, stopwords, and end of line codes)

stop_words = set(stopwords.words('english'))
my_punctuation = string.punctuation +'"'+'"'+'-'+'''+'''+'—'
removal_list = set(stop_words) | set(my_punctuation) | {'lt','rt', '\n'}

len(removal_list)

# just check what's in that list
removal_list

Import data to create ngrams¶

Here, we will use the Reuters corpus. After you import it into sents, check its contents. You’ll see that it’s a list of sentences, and then a list of words in those sentences.

NOTE for lab: If you are going to do this with regular text data (text that you read from a file), you need to first tokenize each sentence. You can use code like the following, given that the text you read from the file is called text_from_file.

text_tokenized = word_tokenize(text_from_file)

# import Reuters and inspect it
sents = reuters.sents()
sents

Define and use a function to clean the text¶

This function flattens the list of sentences, which are in turn a list of words, and flattens them to simply a list of words. This is because Reuters provides a list of sentences. For plain text, you’d skip the ‘flattened’ part of this function.

Then, call the function to clean the text and remove all the stopwords, punctuation, etc., that are in the removal_list.

def clean_flattened_sentences(sentences):
    # flatten the list of sentences into a single list of words
    flattened = [word for sent in sentences for word in sent]
    
    # clean the flattened list by removing stop words and punctuation
    cleaned_flattened = [word for word in flattened if word not in removal_list]
    
    return cleaned_flattened

# call the function
cleaned_flat_sentences = clean_flattened_sentences(sents)

# print the first 50 cleaned words, just to inspect the list
print(cleaned_flat_sentences[:50])

len(cleaned_flat_sentences)

Run everygrams and create a dictionary¶

everygrams() creates a list of all the unigram, bigram, trigram combinations in the text. Then, we use FreqDist() to create a dictionary with the frequency of each of those ngrams.

Warning¶

one_two_three_ngrams is a generator (see what happens when I say type(one_two_thre_ngrams()), so you shouldn’t try to print it or inspect it before you do the FreqDist. Just run the next two lines one after the other.

# create the ngrams
one_two_three_ngrams = everygrams(cleaned_flat_sentences, 1, 3, pad_left=True, pad_right=True)

# create a dictionary with the frequency of all the ngrams
word_salad = FreqDist(one_two_three_ngrams)

# this is just so that you know this variable is a generator
type(one_two_three_ngrams)

# inspect the word salad. It should be a dictionary
word_salad

# sort the dictionary in reverse order (the result is a list, but that's fine, as we only want to see it)
word_salad_ordered = sorted(word_salad.items(), key=lambda x:x[1], reverse=True)

# print the first 20 items
word_salad_ordered[:20]

# print the last 20 items
word_salad_ordered[-20:]

Generating text from the word salad lists¶

I can give the list of ngrams a prefix, or a ‘prompt’, and it will give me all the possible things that come after it. This is just an unordered list, but you can also order it by the most frequent.

# given an input "prompt"
prefix = 'company loss'

# Check what's most likely to come next
# make sure there are no None values
matching_ngrams = [
    ng for ng in word_salad 
    if all(word is not None for word in ng) and ' '.join(ng).lower().startswith(prefix.lower())
]

print(matching_ngrams)

# try a different prompt
prefix = 'they said'

matching_ngrams = [
    ng for ng in word_salad 
    if all(word is not None for word in ng) and ' '.join(ng).lower().startswith(prefix.lower())
]

print(matching_ngrams)

Summary¶

We have learned a basic principle of generative AI: to predict the next word in a sequence based on a corpus.

Note that ngrams are not actually generative AI. LLMs use vector embeddings instead of word representations (tokens).