Pandas String Operations
Pandas String Operations
Motivation: NumPy generalizes arithmetic operations to perform the same operation on many array elements
>>> import numpy as np >>> x = np.array([2, 3, 5, 7, 11, 13]) >>> x * 2 array([4, 6, 10, 14, 22, 26])
Pandas String Operations
NumPy does not provide vectorized operations for strings, so we need to use more verbose loop syntax
>>> data = ['Alice', 'Bob', 'Eve'] >>> [s.upper() for s in data] ['ALICE', 'BOB', 'EVE']
This will break if there are missing values
>>> data = ['Alice', 'Bob', None, 'Eve'] >>> [s.upper() for s in data] AttributeError ...
Pandas String Operations
Pandas includes features for vectorized string operations that correctly handle missing data via the
str
attribute of Series and Index objects>>> import pandas as pd >>> names = pd.Series(data) >>> names.str.upper() 0 ALICE 1 BOB 2 None 3 EVE dtype: object
Methods Similar to Python String Methods
len |
lower |
translate |
islower |
ljust |
upper |
startswith |
isupper |
rjust |
find |
endswith |
isnumeric |
center |
rfind |
isalnum |
isdecimal |
zfill |
index |
isalpha |
split |
strip |
rindex |
isdigit |
rsplit |
rstrip |
capitalize |
isspace |
partition |
lstrip |
swapcase |
istitle |
rpartition |
Methods Using Regular Expressions
There are several methods that accept regular expressions
Method | Description |
---|---|
match |
Call re.match on each element, returning a Boolean |
extract |
Call re.match on each element, returning matched groups as strings |
findall |
Call re.findall on each element |
replace |
Replace occurrences of pattern with some other string |
contains |
Call re.search on each element, returning a Boolean |
count |
Count occurrences of pattern |
split |
Equivalent to str.split , but accepts regexps |
rsplit |
Equivalent to str.rsplit , but accepts regexps |
Methods Using Regular Expressions
Example: find all names that start and end with a consonant
>>> names.str.findall(r'^[^AEIOU].*[^aeiou]$') 0 [] 1 [Bob] 2 None 3 [] dtype: object
Miscellaneous Methods
Method | Description |
---|---|
get |
Index each element |
slice |
Slice each element |
slice_replace |
Replace slice in each element with passed value |
cat |
Concatenate strings |
repeat |
Repeat values |
normalize |
Return Unicode form of string |
pad |
Add whitespace to left, right, or both sides of string |
wrap |
Split long strings into lines with length less than a given width |
join |
Join strings in each element of the Series with passed separator |
get_dummies |
Extract dummy variables as a dataframe |
Vectorized Item Access and Slicing
Example: slice the first two characters
>>> names.str.slice(0,2) 0 Al 1 Bo 2 None 3 Ev dtype: object
Can also be doen with Python’s normal indexing syntax
names.str[0:2]
Indicator Variables
The
get_dummies
method is useful when the data has a column containing some sort of coded indicator.>>> df = pd.DataFrame( {'name': names, 'info': ['A|C|D', 'B|D', 'A|C', 'B|C|D'] }) >>> df name info 0 Alice A|C|D 1 Bob B|D 2 None A|C 3 Eve B|C|D
Indicator Variables
Example (continued)
>>> df['info'].str.get_dummies('|') A B C D 0 1 0 1 1 1 0 1 0 1 2 1 0 1 0 3 0 1 1 1