Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Unit 4 - Regular expressions. Functions

Regular expressions are powerful methods to do the same thing over and over again. When you do a search and replace in something like Microsoft Word, you are using regular expressions.

We’ll be using the python re module, for regular expressions. You can follow along on a python tutorial about the re module.

But, first, we’ll start with a review of functions.

We’ll also learn how to structure notebooks (and code in general) a bit more neatly. We’ll start by importing all the modules we need, at the beginning of the notebook.

import os
import csv
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk import FreqDist

Functions

We have been using functions already but we never really went in depth on what they are. A function is really just a way to do some computation and receive a result from that computation. To do this the function needs an input that we provide to it. So the three parts of any function is input, computation, and output. Take for example the open() function. We pass it a path to a file and it returns us an opened file. We don’t get to see the computation since it is hidden and we don’t ask for a print output.

Python comes with many predefined functions. However, sometimes we need to create our own functions so that we can easily reuse code.

To create a user-defined function we start by defining the function name and input what it takes using def. Then we put our code after the defining line.

Note the colon and new line. The contents of the function are always indented below the colon.

You can also optionally include a statement outlining what the function does. This is like a comment, except it has a slightly different format. Note the 3 quotation marks and that the text appears in a different colour. This is an instruction to python to store that information as special information. You can then retrieve that information with a help(function_name) statement.

If we want our function to return some answer or result then we need to include a return commmand at the end of our function.

In the next code block I will create a function for computing lexical diversity. To review how functions work here is a good tutorial: https://www.w3schools.com/python/python_functions.asp

# first we define our function and its input
def lexDiv(types, tokens):
    # help or description for the function. Note the format
    """
    Calculates lexical diversity

    Args:
        2 integers or 2 variables: length of types and length of tokens

    Returns:
        A float, the lexical diversity
    """
    #  we make a computation, avoiding division by 0
    diversity = types/tokens if tokens > 0 else 0
    # lastly we return the result
    return diversity
# once defined, we can use, or 'call' the function
# here, we call it with 2 fixed numbers. Usually, you'd give it 2 variables (types and tokens)
print(lexDiv(4328, 12094))
# get the description or help on this function
help(lexDiv)
help(len)

Note that this function, the way it’s defined, has 2 arguments: types and tokens. If you try to call the function with only 1 argument, or with 3, you’ll get an error.

print(lexDiv(4328))
print(lexDiv(4328, 12094, 15))

Defining a function, reading multiple files, and using the function on those files

We are now going to put together everything we’ve learned so far. We’ll define a function to use NLTK to calculate types and tokens, create another function to get and process all the files in a directory, and then call the function. This is a more efficient way of doing what we did for Week 3.

Function to get tokens, types, and lexical diversity

# define the function. We give it an easy to remember name
# and the argument is a variable that contains a string, 'text'
def get_text_info(text):
    """
    Uses NLTK to calculate: tokens, types, lexical diversity
    
    Args:
        text (str): a string containing the file or text
        
    Returns: 
        dict: a dictionary containing tokens, types, and lexical diversity
    """
    # call the NLTK function to tokenize and store the results in 'tokens'
    tokens = nltk.word_tokenize(text)
    # get the length of the variable 'tokens'
    n_tokens = len(tokens)
    # get the length of the types
    n_types = len(set(tokens))
    # calculate the lexical diversity
    # we can do it directly here, or call the function we created above
    lexical_diversity = n_types / n_tokens if n_tokens > 0 else 0
   # lexical_diversity = lexDiv(n_types, n_tokens)
    # we also need to tell the function what information to return
    # here, we create a dictionary to store it all
    return {
            'tokens': n_tokens,
            'types': n_types,
            'lexical_diversity': lexical_diversity
        }

Function to read and process all the files in a directory

# define the function
def process_dir(path):
    """
    Reads all the files in a directory. Processes them using the 'get_text_info' function
    
    Args: 
        path (str): path to the directory where the files are
        
    Returns:
        dict: a dictionary with file names as keys and the tokens, types, lexical diversity, as values
    
    """
    file_info = {}

    # loop through all the files in the directory "data"
    for filename in os.listdir(path):
        # check only for .txt files
        if filename.endswith(".txt"):    
            # get all the filenames with a .txt extension
            file_path = os.path.join(path, filename)      
            # open one file at a time, to read it, and with utf encoding
            with open(file_path, 'r', encoding="utf-8") as f:
                # store the contents of the file into the variable "text"
                text = f.read()
                # call the function on each file
                file_info[filename] = get_text_info(text)
    # return the info
    return file_info

Using the functions

Now we call the process_dir() function, which calls the get_text_info() function. This is a cleaner, more modular way of writing code.

# define the path. This directory should have more than 1 file
path = './data'

files_in_dir_info = process_dir(path)

Check the output

You can use the print() function to check the output. You can also modify this for() loop to save this information to a csv file. Note that, because files_in_dir_info is a dictionary, we need to go through its items.

for file, info in files_in_dir_info.items():
    print(f"File: {file}")
    print(f"Tokens: {info['tokens']}")
    print(f"Types: {info['types']}")
    print(f"Lexical diversity: {info['lexical_diversity']}")

Regular expressions

You often get data that needs a bit of extra cleaning. Regular expressions are a simple yet powerful way to do that.

Note that we already imported the re module, above. Now we get to use one of its functions, search(). As with all functions that come from a module in python, you use it by typing the name of the module before the function: re.search().

re.search() takes two arguments:

  1. the pattern you are searching for

  2. the place where you are searching for it (usually a string)

re.search('e', 'beekeeper) finds the first instance of the letter ‘e’ in the word ‘beekeeper’. Try it below.

But, if you want to find not just the first, but all of the instances, then you use re.findall(). Try it below as well.

Finally, you can use re.sub() to replace text that matches a certain pattern. re.sub() takes 3 arguments: the pattern, the thing to replace it with, and the string. Try it below to replace ‘the’ with ‘a’.

There are a few useful conventions in regular expressions:

  • [] matches one of the things inside

    • [Tt] matches either upper or lower case ‘t’

  • [-] matches a range

    • [0-9] matches a single instance of all the numbers from 0 to 9

    • [a-z] matches all lowercase letters

    • [A-Z] matches all uppercase letters

  • [*] matches 0 or more of the previous characters

    • [o*h!] matches: h!, oh!, ooh!, oooh!, ooooh!, etc

  • [+] matches 1 or more of the previous characters

    • [o+h!] matches: oh!, ooh!, oooh!, ooooh!, etc

  • [.] ignores the previous character

    • [beg.n] matches: begin, began, begun, but also begon and beg3n. This is useful to identify patterns such as sh!t.

  • Special characters, such as ., ,, or \, need to be preceded by a backslash, so that the regular expression knows that you mean literally the actual punctuation mark, not its use as a regular expression convention.

    • [\.] means “find the first instance of a period”

  • [^] matches the beginning of the line; [$] matches the end of the line

re.search('e', 'beekeeper')
re.findall('e', 'beekeeper')
sentence = "The sentence contains the word the spelled in upper and lower case."
re.findall('[Tt]he', sentence)
a_sentence = re.sub(r'[Tt]he', 'a', sentence)
print(a_sentence)

Clean up a file

The most useful aspect of regular expressions is that you can use them to get rid of stuff you don’t need in a file or series of files. Let’s say we just want to analyze the language of a script. Then, we want to remove all characters’ names and stage directions, as we are likely only interested in the dialogue.

We are going to take the Ghostbusters file we downloaded a while back (or any other script from Scificripts) and we’ll clean it up.

First, we read it in into a variable and print it on the screen.

filePath = "./data/Ghostbusters.txt"

with open(filePath, "r", encoding="utf-8") as f:
    ghostbusters = f.read()
    
print(ghostbusters) 

You’ll see that most of the stuff we want to get rid of is in all caps on its own line:

  • FADE IN

  • EXT. NEW YORK PUBLIC LIBRARY -- DAY

  • LIBRARIAN

We’ll write a regular expression using re.sub() to find all instances of entire lines that are only in upper case and replace them with nothing. Note that only uppercase lines also have spaces, periods and hyphens.

  • The first argument is: ^[A-Z\s\.\,\-]+$\n?

    • ^ indicates the beginning of the line

    • A-Z means any uppercase character

    • \s means space

    • \. means period

    • \, means comma

    • \- means hyphen

    • + means 1 or more instances of an uppercase character

    • $ means the end of the line

    • \n means the end of line character

    • ? means that the \n is optional

  • The second argument is: ''

    • This means ‘replace with nothing’, as there are no characters between the quotes

  • The third argument is the variable to operate on

  • We also use the flag MULTILINE to make sure we match the entire line, not just a string within the line

ghostbusters_clean = re.sub(r'^[A-Z\s\.\,\-]+$\n?', '', ghostbusters, flags=re.MULTILINE)
print(ghostbusters_clean)

Note that there are still a couple of things that should be removed:

  • lines that contain a character name plus parentheses: VENKMAN (V.O.)

  • text in parentheses: (puzzled)

You can write additional regular expressions to deal with those. For the first case, you can just probably modify the regular expression above to include parentheses.

Summary

In this notebook you have learned some concepts about writing and using functions:

  • Structure and arguments for functions

  • How to write the help part of functions

  • How to use a function to read and process files in a directory

You have also learned about regular expressions:

  • Main functions in re

  • Some operators for regular expressions

  • How to use regular expressions to clean up text