Unit 2 - Text as data
Text as data¶
We are interested in text data as a source of information. This is usually called ‘unstructured’ data, because it doesn’t come in a nice table or summary. Computational processing of text means turning language data into structures that we can process and analyze. That’s what we are doing in this course!
But, before we do any analysis, we need to think about what data we are analyzing
Where it comes from
Who produced it
What it contains
How it represents the world
How it’ll be used in the future
SFU has a set of General principles of ethical behaviour in research practices and for data management, through the Library. The main concerns are anonymity and privacy.
Anonymity¶
Data collected from human subjects should be anonymized, to avoid any potential disclosure of confidential information.
When recording/collecting
Keep all personal information separate from the data
Keep personal information secure
Obey rules about how long to store such information
When distributing data
Remove any personal identifiers
If identifiers are needed, use pseudonyms
Privacy¶
You can make your data anonymous, but still disclose private details about an individual, or a third party.
You may have removed the name, but left other identifying information
Data triangulation and aggreggation is a serious issue. By overlapping information from different databases, you can reveal somebody’s identity.
Finding and saving data¶
There are lots of very good introductions to how to find and save data. I recommend the section on Data collection from Melanie Walsh’s course “Introduction to Data Analytics and Python”. Realpython also has a Practical introduction to web scraping.
We’ll use some of the information there to scrape web data. We are scraping a publicly-available website, so there are no anonymity, privacy, or copyright issues.
Scraping Scifiscripts¶
We will be using a module called requests to do this. Modules or libraries are collections of functions and utilities already built in python. To use them, you first have to import them. It’s good practice to import all the modules at the beginning of your code.
requests has a function get() that goes and grabs the page. Let’s say we want to get the script for the movie Ghostbusters. Go to the Scifiscripts webpage and check what’s there. We are going to use the url to get the text of the page.
We first collect the response from querying the html into the variable response. Then, we use the text function to store the text associated with that url into the variable html_string.
If you look at that variable, you’ll see that it has lots of code like \r and \n. But you can also get it to print like a web page with print().
Finally, we will save that file to use later. You should have a directory called data/ where this notebook is. Then simply give python the path to that directory and put the file there. Navigate to the directory (remember, you can use pwd to find it) and check that Ghostbusters.txt is there.
# import the module that we need
import requests# use the get() function and store the value in the variable 'response'
response = requests.get("http://www.scifiscripts.com/scripts/Ghostbusters.txt")
# store the html
html_string = response.texthtml_stringprint(html_string)# open a file
with open("./data/Ghostbusters.txt", 'w') as out:
out.write(html_string)Navigate (using Mac Finder or Windows Explorer) to the directory printed below and check that Ghostbusters.txt is under the directory data/.
pwdNote that if your file is in a format other than plain text, you’ll also have to indicate the encoding. For instance, if the file was in utf-8, you’d say:
with open("./data/Ghostbusters.txt", 'w', encoding="utf-8") as out:
out.write(html_string)