Home Artificial Intelligence Use Regex Patterns in Pandas to Work With Complex Strings Conclusion

Use Regex Patterns in Pandas to Work With Complex Strings Conclusion

2
Use Regex Patterns in Pandas to Work With Complex Strings
Conclusion

Photo by Chris Moore on Unsplash

is probably the most potent technique to wash and extract data. Should you’ve ever worked with a big text dataset, you’d understand how time-consuming and energy-draining it will be.

I often use to wash phone numbers and emails and standardize addresses. But there are complex use cases as well.

We noticed inconsistent office columns in our recent data pipeline from a particular data source. We only needed the office code from this column. It’s two or three letters followed by a colon and a two-digit number. Earlier, we used an easy replace operation to map the column to our desired values. But as recent data proved inconsistent with our assumption, we had to alter the strategy. Since we will make sure the pattern is consistent, we used to wash them. This manner, we never must worry about changing column values.

But in case your dataset is significantly large, and you would like the extracted values stored in recent columns next to every row, you’d be tempted to make use of the map or apply methods in Pandas. But Pandas natively provides excellent APIs for string operations. Just about all of them support .

# With apply
import re
df.office.apply(lambda x: re.search(r'd+', x).group(0))

# With native string operation
df.office.str.extract(r'(d+)')

Before we discuss the advantages of those native APIs over the map/apply method, here’s what I mean.

Comparing Pandas string extract with the same old

The next code generates an artificial dataset using the Faker. It generates 100K fake addresses and stores them in a Pandas Series. You’ll be able to adjust the scale by changing n to a more considerable value your computer can support.

import pandas as pd
from faker import Faker

faker = Faker()

n = 100000

address= pd.Series([faker.address() for i in range(n)])

"""
| | address |
|---:|:-----------------------------|
| 0 | 548 Small Parkways Suite 832 |
| | South Brianborough, DC 50474 |
| 1 | 94291 Jerry Pass Suite 992 |
| | East Rebecca, PR 87881 |
| 2 | 3973 Sensible Spring |
| | Grantfort, AS 52002 |
| 3 | 62589 David Island |
| | East Kathleenville, OH 45208 |
| 4 | 0415 Jimenez Hill Apt. 642 |
| | Gambleland, WA 99356 |
"""

Our goal is to extract the state and zip codes to separate columns. This task is straightforward enough to do in spreadsheet software. But let’s keep it for our discussion. And let’s assume we’re using .

Here is my usual way of mapping or applying a to the series.

(
(address)
.map(lambda x: (re.search(r"w{2} d{5}", x).group()))
.str.split(" ", expand=True)
.rename(columns={0: "state", 1: "zip_code"})
)

"""
| | state | zip_code |
|---:|:--------|-----------:|
| 0 | DC | 50474 |
| 1 | PR | 87881 |
| 2 | AS | 52002 |
| 3 | OH | 45208 |
| 4 | WA | 99356 |
"""

The code above is straightforward to grasp. We match all of the two letters followed by a whitespace and five digits. Then we do a string split and expand it to separate columns. Finally, we name the columns ‘state’ and ‘zip_code.’

But here’s the Pandas’s way of doing it.

address.str.extract(r"(?Pw{2}) (?Pd{5})")

"""
| | state | zip_code |
|---:|:--------|-----------:|
| 0 | DC | 50474 |
| 1 | PR | 87881 |
| 2 | AS | 52002 |
| 3 | OH | 45208 |
| 4 | WA | 99356 |
"""

That is unarguably more elegant than the previous code. We use named groups in our pattern, which becomes the column name.

On a separate note, you may make a part of a pattern a bunch by wrapping them in parentheses. You’ll be able to name each group by adding ?P before you begin describing your pattern.

Okay, the native method is great in readability. But what about performance?

I’ve used the timit utility in Jupyter notebook to record the execution times. I don’t see the native way having the upper hand on performance. Mapping is quicker.

But our desired output shouldn’t be yet finished with a single map function. We want to do additional steps to get in there. The entire set of operations costs barely more time than the extract method.

Comparing the map/apply method performance against the extract method
Comparing the map/apply method performance against the extract method — Screenshot by the writer.

Besides readability, each methods aren’t too different. However the difference becomes significant if you happen to’re working with an unlimited dataset.

Also, in case your code runs in a resource-constrained environment, you will need to resolve fastidiously. That’s often my case, as I construct data pipelines mostly. I want to make sure that I’m using the optimal code for faster and cheaper processing of live data.

We all know there’s a couple of option to do wrangling in Pandas. Should you anticipate re-running the scripts in the long run, you could have to spend a while experimenting with different alternatives.

Useful Pandas string methods with

Now that we understand how easy to make use of operations directly without mapping or applying a function, listed here are some methods I steadily use.

We’ve already seen one example of using the extract API within the previous section. It’s handy with patterns; perhaps that’s the one I exploit most.

Listed here are three other techniques I often use.

Let’s suppose a column accommodates state and zip codes. We want them separated into individual columns. Since that is sourced from a free-form input form, the separator shouldn’t be all the time a whitespace or a comma.

import pandas as pd

# Create a sample dataframe
df = pd.DataFrame(
{
"location": [
"New York 10001",
"California 90001",
"Texas-75201",
"Georgia 30301",
"Oregon97205",
"Arizona 85001",
"Illinois 60601",
"Florida 33101",
"Ohio 44101",
"Pennsylvania-19104",
]
}
)

df["location"].str.split(r"[s|-]*(d{5})", expand=True)

'''
| | 0 | 1 |
|---:|:-------------|------:|
| 0 | Recent York | 10001 |
| 1 | California | 90001 |
| 2 | Texas | 75201 |
| 3 | Georgia | 30301 |
| 4 | Oregon | 97205 |
| 5 | Arizona | 85001 |
| 6 | Illinois | 60601 |
| 7 | Florida | 33101 |
| 8 | Ohio | 44101 |
| 9 | Pennsylvania | 19104 |
'''

I had a dataset where there was an office serial number. The number had a pattern. The primary two letters denote the country’s shortcode. A location code follows the country code. It’s a three-digit number. Then a hyphen and a department id, which is one other three-digit number.

Suppose we want to filter all of the records related to the finance department of nations, UK, India, and Australia. We will do something like this.

import pandas as pd

# Create a sample dataframe
data = {
"office_serial_number": [
"US101-001",
"UK201-006",
"CA301-003",
"AU401-004",
"UK202-005",
"IN302-006",
"IR102-007",
"AU402-006",
"SL303-009",
"UK203-010",
"FR403-011",
"US103-012",
]
}

df = pd.DataFrame(data)

df[df.office_serial_number.str.contains("^(UK|IN|AU)d{3}-006")]

'''
| | office_serial_number |
|---:|:-----------------------|
| 1 | UK201-006 |
| 5 | IN302-006 |
| 7 | AU402-006 |
'''

This may very well be a tricky task if it is not for . And it shouldn’t be a readable one, either.

Replace is a frequent string operation. Even in Excel, we do it rather a lot. But some substitute operations are more complex than a simple find and replace.

We want to seek out patterns and replace them with recent strings.

Take a phone number column, for instance. You have to remove the country codes from the column. Some records have a rustic code, and a few don’t. Even the one with a rustic code has different formats.

Here’s an easy example of doing this.

import pandas as pd
import re

# create a sample dataframe (dummy)
df = pd.DataFrame({'phone': ["+1 555-555-5555", "+44 20 7123 4567", "+81 3-1234-5678", "0049 30 12345678", "+61 2 1234 5678", "+33 1 23 45 67 89", "+86 10 1234 5678", "011 52 55 1234 5678", "+971 4 123 4567", "+49 911 12345678", "(+81) 3-1234-5678"]})

# define an everyday expression pattern to match the country code
pattern = r'^+?d{1,3}[s-]?' # match + or digit(s) followed by space/hyphen

# apply the pattern to the 'phone' column and replace the matches with an empty string
df['phone'] = df['phone'].apply(lambda x: re.sub(pattern, '', x))

print(df)

phone
0 555-555-5555
1 20 7123 4567
2 3-1234-5678
3 30 12345678
4 2 1234 5678
5 1 23 45 67 89
6 10 1234 5678
7 52 55 1234 5678
8 4 123 4567
9 911 12345678
10 3-1234-5678

You’d be superb together with your spreadsheet software for basic string manipulations. But for more advanced use cases, a programming language could save hours of your time.

Some operations are complex even to handle even with the essential APIs of a programming language. Especially the one where patterns are involved. That is where we get a for rescue.

Should you’re a Pandas user, you may directly use in its native APIs. This has the advantage of a clean codebase with fewer lines. That’s the main focus of this post.

I’ve discussed a few of my favorite tricks in Pandas. Although there aren’t any significant performance improvements, I still prefer these methods because they’re straightforward.

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here