Pandas String Operations

CSC 223 - Advanced Scientific Programming

Pandas String Operations

Motivation: NumPy generalizes arithmetic operations to perform the same operation on many array elements
```
>>> import numpy as np
>>> x = np.array([2, 3, 5, 7, 11, 13])
>>> x * 2
array([4, 6, 10, 14, 22, 26])
```

Pandas String Operations

NumPy does not provide vectorized operations for strings, so we need to use more verbose loop syntax
```
>>> data = ['Alice', 'Bob', 'Eve']
>>> [s.upper() for s in data]
['ALICE', 'BOB', 'EVE']
```

This will break if there are missing values

>>> data = ['Alice', 'Bob', None, 'Eve']
>>> [s.upper() for s in data]
AttributeError
...

Pandas String Operations

Pandas includes features for vectorized string operations that correctly handle missing data via the str attribute of Series and Index objects
```
>>> import pandas as pd
>>> names = pd.Series(data)
>>> names.str.upper()
0    ALICE
1      BOB
2     None
3      EVE
dtype: object
```

Methods Similar to Python String Methods

`len`	`lower`	`translate`	`islower`
`ljust`	`upper`	`startswith`	`isupper`
`rjust`	`find`	`endswith`	`isnumeric`
`center`	`rfind`	`isalnum`	`isdecimal`
`zfill`	`index`	`isalpha`	`split`
`strip`	`rindex`	`isdigit`	`rsplit`
`rstrip`	`capitalize`	`isspace`	`partition`
`lstrip`	`swapcase`	`istitle`	`rpartition`

Methods Using Regular Expressions

There are several methods that accept regular expressions

Method	Description
`match`	Call `re.match` on each element, returning a Boolean
`extract`	Call `re.match` on each element, returning matched groups as strings
`findall`	Call `re.findall` on each element
`replace`	Replace occurrences of pattern with some other string
`contains`	Call `re.search` on each element, returning a Boolean
`count`	Count occurrences of pattern
`split`	Equivalent to `str.split`, but accepts regexps
`rsplit`	Equivalent to `str.rsplit`, but accepts regexps

Methods Using Regular Expressions

Example: find all names that start and end with a consonant

>>> names.str.findall(r'^[^AEIOU].*[^aeiou]$')
0       []
1    [Bob]
2     None
3       []
dtype: object

Miscellaneous Methods

Method	Description
`get`	Index each element
`slice`	Slice each element
`slice_replace`	Replace slice in each element with passed value
`cat`	Concatenate strings
`repeat`	Repeat values
`normalize`	Return Unicode form of string
`pad`	Add whitespace to left, right, or both sides of string
`wrap`	Split long strings into lines with length less than a given width
`join`	Join strings in each element of the Series with passed separator
`get_dummies`	Extract dummy variables as a dataframe

Vectorized Item Access and Slicing

Example: slice the first two characters

>>> names.str.slice(0,2)
0      Al
1      Bo
2    None
3      Ev
dtype: object

Can also be doen with Python’s normal indexing syntax

names.str[0:2]

Indicator Variables

The get_dummies method is useful when the data has a column containing some sort of coded indicator.

>>> df = pd.DataFrame(
          {'name': names,
           'info': ['A|C|D', 'B|D', 'A|C', 'B|C|D']
          })
>>> df
  name   info
0  Alice  A|C|D
1    Bob    B|D
2   None    A|C
3    Eve  B|C|D

Indicator Variables

Example (continued)

>>> df['info'].str.get_dummies('|')
   A  B  C  D
0  1  0  1  1
1  0  1  0  1
2  1  0  1  0
3  0  1  1  1