Unit 5 - Pandas and dataframes - Python Text Analysis

Pandas is a python library to work with and store data in a spreadsheet-like format. Pandas allows us to store big datasets of numbers such as type and token counts of hundreds of texts. This notebook introduces some of the basic concepts and functions from Pandas that we’ll need, but there are also great introductions out there:

We will work with two basic Pandas data structures:

Series
Dataframes

A series is very similar to a list or a dictionary. For instance, we have seen lists of words (tokens) extracted from a document and dictionaries of words and their frequencies:

List: ['If', 'you', 'use', '``', 'bad', "''", 'to', 'mean', '``', 'good', "''", ',', 'then', ...]
Dictionary: {',': 31, 'the': 24, '.': 21, 'a': 14, 'and': 14, 'to': 13, "'s": 6, 'The': 6, 'I': 6, 'is': 5, ...}

In a series, you’ll get that dictionary as a vertical list:

,     31
the   24
.     21
a     14
and   14

Then, with Pandas, you can turn those series into dataframes. Dataframes are like spreadsheets. So, from the dictionary above, we can create a dataframe that looks sort of like this:

Token	Count
,	31
the	24
.	21
a	14
and	14

We’ll first start with the same process we’ve done before, where we had files and their token, type, sentence, and lexical diversity count. We’ll convert those lists and dictionaries into series and then put them into a dataframe. Finally, and to save the information, we’ll use a Pandas function to save that dataframe into a csv file. A comma-separated value file is an easy way to store table information in a text-only format.

We start by importing all the packages we need. Note that we import Pandas as “pd”. That allows us to type the Pandas functions with a shorthand, “pd”, rather than the full “pandas”.

import os
import csv
import nltk
import numpy
import matplotlib
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import FreqDist
import pandas as pd

Function definitions¶

We’ll reuse the functions from Unit 4, with some modifications. The first one, get_text_info() now only calculates tokens and types. We’ll use the pandas dataframe to calculate the lexical diversity.

The second function is the same. It reads all the files in the directory and calculates the information from get_text_info.

def get_text_info(text):
    """
    Uses NLTK to calculate: tokens, types
    
    Args:
        text (str): a string containing the file or text
        
    Returns: 
        dict: a dictionary containing tokens, types
    """
    tokens = nltk.word_tokenize(text)
    n_tokens = len(tokens)
    n_types = len(set(tokens))
    return {
            'tokens': n_tokens,
            'types': n_types,
        }


def process_dir(path):
    """
    Reads all the files in a directory. Processes them using the 'get_text_info' function
    
    Args: 
        path (str): path to the directory where the files are
        
    Returns:
        dict: a dictionary with file names as keys and the tokens, types, as values
    
    """
    file_info = {}

    for filename in os.listdir(path):
        if filename.endswith(".txt"):    
            file_path = os.path.join(path, filename)      
            with open(file_path, 'r', encoding="utf-8") as f:
                text = f.read()
                file_info[filename] = get_text_info(text)
    return file_info

Call the functions¶

We call the function process_dir(), which in turns calls get_text_info() and returns a dictionary with the name of the file as key, and n_tokens and n_types for each file.

# define the path. This directory should have more than 1 file
path = './data'

files_in_dir_info = process_dir(path)

files_in_dir_info

Dictionaries to pandas dataframe¶

Now, instead of printing the information in files_in_dir_info, we convert that information, which is stored as a dictionary, into a dataframe, essentially a table. The information looks like this (see the output above):

    {'Ghostbusters.txt': { 'tokens': 29349, 
                           'types': 4862},
     'middlemarch.txt':  { 'tokens': 374039, 
                           'types': 20420},
     'noise.txt':        { 'tokens': 114158, 
                           'types': 8603}}

So you see how it’s already structured like a table. You can think of the names of the files as headings, and then each file has a row for ‘tokens’ and ‘types’.

We use the DataFrame.from_dict() function in pandas, giving it the dictionary, so it transforms it into a dataframe. The orient='index' flag makes the key in the dictionary the name of the row. So the file names are the rows.

You can just print the contents of the dataframe calling the variable df. Note the difference in the output when you use print(df). It’s just less pretty, because it’s text-only.

df = pd.DataFrame.from_dict(files_in_dir_info, orient='index')

df

print(df)

Operations on the df¶

Now you can also do operations on this dataframe. For instance, you can calculate the lexical diversity (types/tokens) and add a new column with that information.

The first part of the cell, df['lex_div'] creates a new column, “lex_div” in the dataframe. The square brackets are indicating that we are dealing with a part of the df. Then, we use information from other columns in the df, namely ‘types’ and ‘tokens’, to populate the ‘lex_div’ column, one row at a time.

This is why I didn’t do this calculation as I was reading the files, in the get_text_info() function. I can just do it on the dataframe.

df['lex_div'] = df['types'] / df['tokens']

df

Calculations on the df¶

You can calculate things from the dataframe, like the average number of tokens or sentences for the entire corpus. These calculations don’t change the dataframe; you simply get information from it.

print(df["tokens"].mean())

print(df["types"].mean())

print(df["lex_div"].mean())

df

Save to csv¶

One of the most useful things about pandas is that you can save the information to a csv file directly, using the columns and rows you already have. You can also read in a csv file and convert it to a dataframe. We are going to save all the information from the original file into a “corpus_info.csv” file, with the index (the first row) called ‘file’.

After you run the first cell below, go to your Jupyter directory and open the csv file. Inspect the contents.

Then, the second cell reads in that information into a new dataframe, df_new. This is not very useful in this notebook, as you already have the original df. But it just shows you how to read in an existing csv.

# save df to a csv file
df.to_csv("./data/corpus_info.csv", index=True, index_label='file')

# read in that csv file into a new variable
df_new = pd.read_csv("./data/corpus_info.csv")

df_new

A few useful things you can do on a dataframe¶

There are all kinds of good things you can do with a dataframe. For instance, you can sort it by the values in one of the columns. The first two cells sort the file name in alphabetical order, regular and reverse. The next sorts by number of tokens.

You can also drop one of the columns, or add an empty column to add in later.

The head() method gives you the first 5 rows. You can also give it a number to get the first n rows. df.head(2) gives you the first 2 rows.

# sort by file in alphabetical order
df_new = df_new.sort_values('file')

df_new

# sort by file in descending alphabetical order
df_new = df_new.sort_values('file', ascending=False)

df_new

# sort by tokens in ascending order
df_new = df_new.sort_values('tokens')

df_new

# drop the types column
df_new = df_new.drop(['types'], axis=1)

df_new

# add an empty column
df_new['new_column'] = ''

df_new

# get the first 5 rows
df_new.head()

# get the first 2 rows
df_new.head(2)

Summary¶

We have learned to use the pandas library to create dataframes, structured tables of information that we can manipulate.

We have also learned to write a dataframe to a csv file, and to read a csv file into a dataframe.