Mastering NLP with spaCy – Part 3

It can be crucial to know methods to use spaCy rules to discover patterns inside some text. There are entities like times, dates, IBANs and emails that follow a strict structure, so it is feasible to discover them with deterministic rules, for instance, through the use of regular expressions (es).

SpaCy simplifies the usage of es by making them more human-readable, so as a substitute of weird symbols, you’ll use actual descriptions using the Matcher class.

Token-based matching

A is a sequence of characters that specifies a search pattern. There’s a Python built-in library to work with es called re: https://docs.python.org/3/library/re.html

Let’s see an example.

"Marcello Politi"
"Marcello   Politi"
"Marcello Danilo Politi"

reg = r"Marcellos(Daniloa)?Politi"

In this instance, the reg pattern captures all of the previous strings. This pattern says that “Marcello” will be followed optionally by the word “Danilo” (since we’re using the symbol “?”). Also, the symbol “s” says that doesn’t matter in between the words we a using an area, a tab or multiple spaces.

The issue with es, and the rationale why many programmers don’t love them, is that they’re difficult to read. For this reason spaCy provides a clean and production-level alternative with the Matcher class.

Let’s import the category and see how we are able to use it. (I’ll explain what Span is later).

import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
nlp = spacy.load("en_core_web_sm")

Now we are able to define a pattern that matches some morning greetings, and we label this pattern “morningGreeting”. Defining a pattern with Matcher is easy. On this pattern, we expect a word that, when converted to lower case, matches the word “good”, then the identical for “morning”, after which we accept so punctuation at the tip.

matcher = Matcher(nlp.vocab)
pattern = [
    {"LOWER": "good"},
    {"LOWER": "morning"},
    {"IS_PUNCT": True},
]
matcher.add("monrningGreeting", [pattern])

A Span is a singular sentence, so the Matcher can find the starting and ending point of multiple spans that we iterate over with a for loop.

We add all of the spans in a listing and assign the list to the doc.spans[“sc”]. Then we are able to use displacy to visualise the span

doc = nlp("Good morning, My name is Marcello Politi!")
matches = matcher(doc)
spans = []

for match_id, start, end in matches:
  spans.append(
      Span(doc, start, end, nlp.vocab.strings[match_id])
  )

doc.spans["sc"] = spans

from spacy import displacy

displacy.render(doc, style = "span")

Image by Creator

A Matcher object accepts multiple pattern at a time!
Let’s define a morningGreeting and a eveningGreeting.

pattern1 = [
    {"LOWER": "good"},
    {"LOWER": "morning"},
    {"IS_PUNCT": True},
]

pattern2 = [
    {"LOWER": "good"},
    {"LOWER": "evening"},
    {"IS_PUNCT": True},
]

Then we add these patterns to the Matcher.

doc = nlp("Good morning, I need to attend the lecture. I'll then say good evening!")
matcher = Matcher(nlp.vocab)

matcher.add("morningGreetings", [pattern1])
matcher.add("eveningGreetings", [pattern2])

matches = matcher(doc)

As before, we iterate over the spans and display them.

spans = []

for match_id, start, end in matches:
  spans.append(
      Span(doc, start, end, nlp.vocab.strings[match_id])
  )

doc.spans["sc"] = spans

from spacy import displacy

displacy.render(doc, style = "span")

The syntax supported by spaCy is big. Here I report a number of the commonest patterns.

Text-based attributes

Attribute	Description	Example
`"ORTH"`	Exact verbatim text	`{"ORTH": "Hello"}`
`"LOWER"`	Lowercase type of the token	`{"LOWER": "hello"}`
`"TEXT"`	Same as `"ORTH"`	`{"TEXT": "World"}`
`"LEMMA"`	Lemma (base form) of the token	`{"LEMMA": "run"}`
`"SHAPE"`	Shape of the word (e.g., `Xxxx`, `dd`)	`{"SHAPE": "Xxxx"}`
`"PREFIX"`	First character(s) of the token	`{"PREFIX": "un"}`
`"SUFFIX"`	Last character(s) of the token	`{"SUFFIX": "ing"}`

Linguistic features

Attribute	Description	Example
`"POS"`	Universal POS tag	`{"POS": "NOUN"}`
`"TAG"`	Detailed POS tag	`{"TAG": "NN"}`
`"DEP"`	Syntactic dependency	`{"DEP": "nsubj"}`
`"ENT_TYPE"`	Named entity type	`{"ENT_TYPE": "PERSON"}`

Boolean flags

Attribute	Description	Example
`"IS_ALPHA"`	Token consists of alphabetic chars	`{"IS_ALPHA": True}`
`"IS_ASCII"`	Token consists of ASCII characters	`{"IS_ASCII": True}`
`"IS_DIGIT"`	Token is a digit	`{"IS_DIGIT": True}`
`"IS_LOWER"`	Token is lowercase	`{"IS_LOWER": True}`
`"IS_UPPER"`	Token is uppercase	`{"IS_UPPER": True}`
`"IS_TITLE"`	Token is in title case	`{"IS_TITLE": True}`
`"IS_PUNCT"`	Token is punctuation	`{"IS_PUNCT": True}`
`"IS_SPACE"`	Token is whitespace	`{"IS_SPACE": True}`
`"IS_STOP"`	Token is a stop word	`{"IS_STOP": True}`
`"LIKE_NUM"`	Token looks like a number	`{"LIKE_NUM": True}`
`"LIKE_EMAIL"`	Token looks like an email address	`{"LIKE_EMAIL": True}`
`"LIKE_URL"`	Token looks like a URL	`{"LIKE_URL": True}`
`"IS_SENT_START"`	Token is at sentence start	`{"IS_SENT_START": True}`

Operators

Used to repeat or make patterns optional:

Operator	Description	Example
`"OP"`	Pattern operator:
	`"?"` – zero or one	`{"LOWER": "is", "OP": "?"}`
	`"*"` – zero or more	`{"IS_DIGIT": True, "OP": "*"}`
	`"+"` – a number of	`{"IS_ALPHA": True, "OP": "+"}`

Example:

What’s a pattern that matches a string like: ?

Pattern Requirements:

Subject pronoun (e.g., “I”, “we”, “they”)
A verb (e.g., “have”, “bought”, “found”)
A number (digit or written, like “2”, “five”)
An optional adjective (e.g., “red”, “ripe”)
A plural noun (fruit, for instance)

pattern = [
    {"POS": "PRON"},                               # Subject pronoun: I, we, they
    {"POS": "VERB"},                               # Verb: have, bought, found
    {"LIKE_NUM": True},                            # Number: 2, five
    {"POS": "ADJ", "OP": "?"},                     # Optional adjective: red, ripe
    {"POS": "NOUN", "TAG": "NNS"}                  # Plural noun: apples, bananas
]

Patterns with PhraseMatcher

We we work in a vertical domain, like medical or scientific, we often have a set of words that spaCy may not concentrate on, and we wish to search out them in some text.

The PhraseMatcher class is the spaCy solution for comparing text against long dictionaries. The usage is sort of much like the Matcher class, but as well as, we’d like to define the list of essential terms we wish to trace. Let’s start with the imports.

import spacy
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher
from spacy import displacy
nlp = spacy.load("en_core_web_sm")

Now we define our matcher and our list of words, and tell Spacy to create a pattern simply to recognise that list. Here, I need to discover the names of tech leaders and places.

terms = ["Sundar Pichai", "Tim Cook", "Silicon Valley"]
matcher = PhraseMatcher(nlp.vocab)
patterns = [nlp.make_doc(text) for text in terms]
matcher.add("TechLeadersAndPlaces", patterns)

Finally check the matches.

doc = nlp("Tech CEOs like Sundar Pichai and Tim Cook met in Silicon Valley to debate AI regulation.")
matches = matcher(doc)
spans= []

for match_id, start, end in matches:
  pattern_name = nlp.vocab.strings[match_id]
  spans.append(Span(doc, start, end, pattern_name))

doc.spans["sc"] = spans
displacy.render(doc, style = "span")

We are able to enhance the capabilities of the PhraseMatcher by adding some attributes. For instance, if we’d like to cach IP addresses in a text, possibly in some logs, we cannot write all of the possible mixtures of IP addresses, that might be crazy. But we are able to ask Spacy to catch the form of some IP strings, and check for a similar shape in a text.

matcher = PhraseMatcher(nlp.vocab, attr= "SHAPE")

ips  = ["127.0.0.0", "127.256.0.0"]
patterns = [nlp.make_doc(ip) for ip in ips]
matcher.add("IP-pattern", patterns)

doc = nlp("This fastAPI server can run on 192.1.1.1 or on 192.170.1.1")
matches = matcher(doc)
spans= []

for match_id, start, end in matches:
  pattern_name = nlp.vocab.strings[match_id]
  spans.append(Span(doc, start, end, pattern_name))

doc.spans["sc"] = spans
displacy.render(doc, style = "span")

IBAN Extraction

The IBAN is a very important information that we frequently must extract when working within the financial fields, for instance if are analysing invoices or transactions. But how can we try this?

Each IBAN has a set international number format, starting with two letters to discover the country.

We’re sure that every IBAN starts with two capital letters XX followed by no less than two digits dd. So we are able to write a pattern to discover this primary a part of the IBAN.

{"SHAPE":"XXdd"}

It’s not done yet. For the remainder of the block we might need from 1 to 4 digits that we are able to express with the symbol “d{1,4}”.

{"TEXT":{":"d{1,4"}}

We are able to have a number of of those blocks, so we are able to use the “+” operator to discover all of them.

{"TEXT":{"":"d{1,4}, "OP":"+"}

Now we are able to mix the form with the blocks identification.

pattern =[
   {"SHAPE":"XXdd"},
   {"TEXT":{"":"d{1,4}, "OP":"+"}
   ]

matcher = Matcher(nlp.vocab)
matcher.add("IBAN", [patttern])

Now let’s use this!

text = "Please transfer the cash to the next account: DE44 5001 0517 5407 3249 31 by Monday."
doc = nlp(text)

matches = matcher(doc)
spans = []

for match_id, start, end in matches:
    span = Span(doc, start, end, label=nlp.vocab.strings[match_id])
    spans.append(span)

doc.spans["sc"] = spans
displacy.render(doc, style="span")

Final Thoughts

I hope this text helped you to see how much we are able to do in NLP without all the time using huge models. Again and again, we just need to search out things that follow rules — like dates, IBANs, names or greetings — and for that, spaCy gives us great tools like Matcher and PhraseMatcher.

For my part, working with patterns like these is a great approach to higher understand how text is structured. Also, it makes your work more efficient while you don’t wish to waste resources on something easy.

I still think is powerful, but sometimes hard to read and debug. With spaCy, things look clearer and easier to keep up in an actual project.

Linkedin ️| X (Twitter) | Website

Mastering NLP with spaCy – Part 3

Token-based matching

Text-based attributes

Linguistic features

Boolean flags

Operators

Patterns with PhraseMatcher

IBAN Extraction

Final Thoughts

Resources

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

a Leaderboard for Real World Use Cases

Patch Time Series Transformer in Hugging Face

Constitutional AI with Open LLMs

Hugging Face Text Generation Inference available for AWS Inferentia2

The best way to Leverage Slash Commands to Code Effectively

Mastering NLP with spaCy – Part 3

Token-based matching

Text-based attributes

Linguistic features

Boolean flags

Operators

Patterns with PhraseMatcher

IBAN Extraction

Final Thoughts

Resources

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.