Motivation: NumPy generalizes arithmetic operations to perform the same operation on many array elements
>>> import numpy as np
>>> x = np.array([2, 3, 5, 7, 11, 13])
>>> x * 2
array([4, 6, 10, 14, 22, 26])
NumPy does not provide vectorized operations for strings, so we need to use more verbose loop syntax
>>> data = ['Alice', 'Bob', 'Eve']
>>> [s.upper() for s in data]
['ALICE', 'BOB', 'EVE']
This will break if there are missing values
>>> data = ['Alice', 'Bob', None, 'Eve']
>>> [s.upper() for s in data]
AttributeError
...
Pandas includes features for vectorized string operations that correctly handle missing data via the str
attribute of Series and Index objects
>>> import pandas as pd
>>> names = pd.Series(data)
>>> names.str.upper()
0 ALICE
1 BOB
2 None
3 EVE
dtype: object
len |
lower |
translate |
islower |
ljust |
upper |
startswith |
isupper |
rjust |
find |
endswith |
isnumeric |
center |
rfind |
isalnum |
isdecimal |
zfill |
index |
isalpha |
split |
strip |
rindex |
isdigit |
rsplit |
rstrip |
capitalize |
isspace |
partition |
lstrip |
swapcase |
istitle |
rpartition |
There are several methods that accept regular expressions
Method | Description |
---|---|
match |
Call re.match on each element, returning a Boolean |
extract |
Call re.match on each element, returning matched groups as strings |
findall |
Call re.findall on each element |
replace |
Replace occurrences of pattern with some other string |
contains |
Call re.search on each element, returning a Boolean |
count |
Count occurrences of pattern |
split |
Equivalent to str.split , but accepts regexps |
rsplit |
Equivalent to str.rsplit , but accepts regexps |
Example: find all names that start and end with a consonant
>>> names.str.findall(r'^[^AEIOU].*[^aeiou]$')
0 []
1 [Bob]
2 None
3 []
dtype: object
Method | Description |
---|---|
get |
Index each element |
slice |
Slice each element |
slice_replace |
Replace slice in each element with passed value |
cat |
Concatenate strings |
repeat |
Repeat values |
normalize |
Return Unicode form of string |
pad |
Add whitespace to left, right, or both sides of string |
wrap |
Split long strings into lines with length less than a given width |
join |
Join strings in each element of the Series with passed separator |
get_dummies |
Extract dummy variables as a dataframe |
Example: slice the first two characters
>>> names.str.slice(0,2)
0 Al
1 Bo
2 None
3 Ev
dtype: object
Can also be doen with Python’s normal indexing syntax
names.str[0:2]
The get_dummies
method is useful when the data has a column containing some sort of coded indicator.
>>> df = pd.DataFrame(
{'name': names,
'info': ['A|C|D', 'B|D', 'A|C', 'B|C|D']
})
>>> df
name info
0 Alice A|C|D
1 Bob B|D
2 None A|C
3 Eve B|C|D
Example (continued)
>>> df['info'].str.get_dummies('|')
A B C D
0 1 0 1 1
1 0 1 0 1
2 1 0 1 0
3 0 1 1 1