Installing spaCy - Python Text Analysis

SpaCy is a set of tools for Natural Language Processing. For more info: spaCy.

This notebook will help you install it, but you can also go to the installation instructions for the best version for your system.

Two ways of installing spaCy¶

1. Jupyter notebook¶

You can install it from the notebook, by running the 2 lines below in your Jupyter notebook (but only if you run this file locally). Remember you only have to do this once.

2. Command prompt¶

In Windows, if you have Anaconda, you can open an Anaconda powershell prompt. If you don’t have Anaconda, just open a Windows powershell in admin mode.
On a Mac, open a terminal window (spotlight and type “terminal”). Or look for “terminal” in your apps folder.

Now that you have a command window open, simply go to the spaCy website and choose your operating system to copy and paste the right commands (one at a time). Click on the right options for you from here: https://spacy.io/usage

Compatibility problems¶

If you get an error that says something about numpy, you can do a few things, below.

Possible error messages:

numpy.ndarray ...
numpy.dtype ...

1. Follow instructions on the spaCy site¶

Go to the heading “Using build constraints when compiling from source” in https://spacy.io/usage. In a command prompt/terminal, type the two lines (one at a time) that start with PIP_CONSTRAINT.

2. Downgrade numpy¶

Type one of the two commands below, either in a notebook or in terminal/command prompt:

In notebook: !pip install numpy==1.26.4
In command prompt: pip install numpy==1.26.4

3. Run in the LING-ENV set up¶

In this Unit 0, find the page that describes how to install a virtual environment in your system. Follow the steps there.

Installing spaCy and language model¶

If running this notebook locally, you’ll only have to do the next two lines once.

!pip install spacy

!python -m spacy download en_core_web_sm

Loading spaCy and language model¶

Installation (if local) only needs to be done once. However, you need to import the spaCy module and load the language model every time you want to use it.

Here, we are loading the small model for English derived from web data. There are other models for English and for other languages.

import spacy

nlp = spacy.load("en_core_web_sm")

Testing installation¶

We’ll define a sentence, process it with spaCy and check the output. This will test whether all the components are installed.

sentence = "This is a test sentence about Canada, but you can type whatever you want here."

Converting string to doc with spaCy¶

spaCy has a special type of object, a Doc. It’s the entire processing pipeline for any NLP system, in a single object. It takes a text, e.g., sent1 and applies all the NLP steps to it (tokenization, tagging, named entity recognition). Once you have converted a string (a sentence) or a whole text to Doc, you can access everything that spaCy has done with it, i.e., the entire structure of language information that it has applied to it, with labels. spaCy refers to that language information and labels as ‘linguistic annotations’. spaCy does this with a simple function, nlp().

Image from https://spacy.io/usage/processing-pipelines

doc = nlp(sentence)

Accesing the information in the Doc object¶

doc contains lots of useful information:

tokens (words)
lemmas
morphology
part of speech tags (pos tags)
syntactic structure (a parse tree)
named entities

# print word tokens

for token in doc:
    print(token)

# lemmas

for token in doc:
    print(token.lemma_)

# morphology

for token in doc:
    print(token.text, token.morph)

# POS tags (more on this below)

for token in doc:
    print(token.text, token.pos_)

# named entities

for ent in doc.ents:
    print(ent.text, ent.label_)