Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Unit 2 - Text as data

Text as data

We are interested in text data as a source of information. This is usually called ‘unstructured’ data, because it doesn’t come in a nice table or summary. Computational processing of text means turning language data into structures that we can process and analyze. That’s what we are doing in this course!

But, before we do any analysis, we need to think about what data we are analyzing

  • Where it comes from

  • Who produced it

  • What it contains

  • How it represents the world

  • How it’ll be used in the future

SFU has a set of General principles of ethical behaviour in research practices and for data management, through the Library. The main concerns are anonymity and privacy.

Anonymity

Data collected from human subjects should be anonymized, to avoid any potential disclosure of confidential information.

  • When recording/collecting

    • Keep all personal information separate from the data

    • Keep personal information secure

    • Obey rules about how long to store such information

  • When distributing data

    • Remove any personal identifiers

    • If identifiers are needed, use pseudonyms

Privacy

You can make your data anonymous, but still disclose private details about an individual, or a third party.

  • You may have removed the name, but left other identifying information

  • Data triangulation and aggreggation is a serious issue. By overlapping information from different databases, you can reveal somebody’s identity.

Finding and saving data

There are lots of very good introductions to how to find and save data. I recommend the section on Data collection from Melanie Walsh’s course “Introduction to Data Analytics and Python”. Realpython also has a Practical introduction to web scraping.

We’ll use some of the information there to scrape web data. We are scraping a publicly-available website, so there are no anonymity, privacy, or copyright issues.

Scraping Scifiscripts

We will be using a module called requests to do this. Modules or libraries are collections of functions and utilities already built in python. To use them, you first have to import them. It’s good practice to import all the modules at the beginning of your code.

requests has a function get() that goes and grabs the page. Let’s say we want to get the script for the movie Ghostbusters. Go to the Scifiscripts webpage and check what’s there. We are going to use the url to get the text of the page.

We first collect the response from querying the html into the variable response. Then, we use the text function to store the text associated with that url into the variable html_string.

If you look at that variable, you’ll see that it has lots of code like \r and \n. But you can also get it to print like a web page with print().

Finally, we will save that file to use later. You should have a directory called data/ where this notebook is. Then simply give python the path to that directory and put the file there. Navigate to the directory (remember, you can use pwd to find it) and check that Ghostbusters.txt is there.

# import the module that we need

import requests
# use the get() function and store the value in the variable 'response'

response = requests.get("http://www.scifiscripts.com/scripts/Ghostbusters.txt")

# store the html

html_string = response.text
html_string
print(html_string)
# open a file
with open("./data/Ghostbusters.txt", 'w') as out:
    out.write(html_string)

Navigate (using Mac Finder or Windows Explorer) to the directory printed below and check that Ghostbusters.txt is under the directory data/.

pwd

Note that if your file is in a format other than plain text, you’ll also have to indicate the encoding. For instance, if the file was in utf-8, you’d say:

with open("./data/Ghostbusters.txt", 'w', encoding="utf-8") as out:
    out.write(html_string)