Pandas String Operations

CSC 223 - Advanced Scientific Programming

Pandas String Operations

  • Motivation: NumPy generalizes arithmetic operations to perform the same operation on many array elements

    >>> import numpy as np
    >>> x = np.array([2, 3, 5, 7, 11, 13])
    >>> x * 2
    array([4, 6, 10, 14, 22, 26])

Pandas String Operations

  • NumPy does not provide vectorized operations for strings, so we need to use more verbose loop syntax

    >>> data = ['Alice', 'Bob', 'Eve']
    >>> [s.upper() for s in data]
    ['ALICE', 'BOB', 'EVE']
  • This will break if there are missing values

    >>> data = ['Alice', 'Bob', None, 'Eve']
    >>> [s.upper() for s in data]
    AttributeError
    ...

Pandas String Operations

  • Pandas includes features for vectorized string operations that correctly handle missing data via the str attribute of Series and Index objects

    >>> import pandas as pd
    >>> names = pd.Series(data)
    >>> names.str.upper()
    0    ALICE
    1      BOB
    2     None
    3      EVE
    dtype: object

Methods Similar to Python String Methods

len lower translate islower
ljust upper startswith isupper
rjust find endswith isnumeric
center rfind isalnum isdecimal
zfill index isalpha split
strip rindex isdigit rsplit
rstrip capitalize isspace partition
lstrip swapcase istitle rpartition

Methods Using Regular Expressions

There are several methods that accept regular expressions

Method Description
match Call re.match on each element, returning a Boolean
extract Call re.match on each element, returning matched groups as strings
findall Call re.findall on each element
replace Replace occurrences of pattern with some other string
contains Call re.search on each element, returning a Boolean
count Count occurrences of pattern
split Equivalent to str.split, but accepts regexps
rsplit Equivalent to str.rsplit, but accepts regexps

Methods Using Regular Expressions

  • Example: find all names that start and end with a consonant

    >>> names.str.findall(r'^[^AEIOU].*[^aeiou]$')
    0       []
    1    [Bob]
    2     None
    3       []
    dtype: object

Miscellaneous Methods

Method Description
get Index each element
slice Slice each element
slice_replace Replace slice in each element with passed value
cat Concatenate strings
repeat Repeat values
normalize Return Unicode form of string
pad Add whitespace to left, right, or both sides of string
wrap Split long strings into lines with length less than a given width
join Join strings in each element of the Series with passed separator
get_dummies Extract dummy variables as a dataframe

Vectorized Item Access and Slicing

  • Example: slice the first two characters

    >>> names.str.slice(0,2)
    0      Al
    1      Bo
    2    None
    3      Ev
    dtype: object
  • Can also be doen with Python’s normal indexing syntax

    names.str[0:2]

Indicator Variables

  • The get_dummies method is useful when the data has a column containing some sort of coded indicator.

    >>> df = pd.DataFrame(
              {'name': names,
               'info': ['A|C|D', 'B|D', 'A|C', 'B|C|D']
              })
    >>> df
      name   info
    0  Alice  A|C|D
    1    Bob    B|D
    2   None    A|C
    3    Eve  B|C|D

Indicator Variables

  • Example (continued)

    >>> df['info'].str.get_dummies('|')
       A  B  C  D
    0  1  0  1  1
    1  0  1  0  1
    2  1  0  1  0
    3  0  1  1  1