Python String Manipulation and Regular Expressions

CSC 223 - Advanced Scientific Programming

Formatting Strings: Adjusting Case

  • upper
  • lower
  • capitalize
  • title
  • swapcase

Formatting Strings: Adjusting Case

  • Examples

    >>> fox = "tHe qUICk bROWn fOx'
    >>> fox.upper()
    'THE QUICK BROWN FOX
    >>> fox.title()
    'The Quick Brown Fox'
    >>> fox.swapcase()
    'ThE QuicK BrowN FoX'

Formatting Strings: Adding and Removing Spaces

  • strip
  • rstrip
  • lstrip
  • center
  • ljust
  • rjust

Formatting Strings: Removing Spaces

  • Example:

    >>> line = '   stuff   '
    >>> line.strip()
    'stuff'
  • The strip methods can remove arbitrary characters

    >>> num = "000000000042"
    >>> num.strip('0')
    '42'

Formatting Strings: Adding Spaces

  • Examples:

    >>> line = 'stuff'
    >>> line.center(20)
    '       stuff        '
    >>> line.ljust(20)
    'stuff               '
    >>> '42'.rjust(10, '0')
    '0000000042'

Finding and Replacing Substrings

  • find
  • rfind
  • index
  • rindex
  • endswith
  • startswith
  • replace

Finding and Replacing Substrings

  • Examples:

    >>> line = 'Hello world'
    >>> line.find('world')
    6
    >>> line.rfind('o')
    7
    >>> line.startswith('He')
    True
    >>> line.replace('o', '---')
    'Hell--- w---rld'

Splitting and Partitioning Strings

  • partition
  • rpartition
  • split
  • splitlines
  • join

Splitting and Partitioning Strings

  • Examples:

    >>> line = 'The quick brown fox jumped over a lazy dog'
    >>> line.partition('fox')
    ('The quick brown ', 'fox', ' jumped over a lazy dog')
    >>> line.split()
    ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']
    >>> '--'.join(['1', '2', '3'])
    '1--2--3'

format Basics

  • The format method requires that the string have format fields indicated with curly braces.

  • The format fields are replaced with objects passed into the format method.

  • Example:

    >>> '{} and {}'.format('Alice', 'Bob')
    'Alice and Bob'

format Arguments

  • The format fields can be replaced by position or keyword

  • Position example:

    >>> '{1} and {0}'.format('Alice', 'Bob')
    'Bob and Alice'
  • Keyword example:

    >>> '{a} and {b}'.format(a='Alice', b='Bob')
    'Alice and Bob'
  • Mixed example:

    >>> '{b} and {0}'.format('Alice', b='Bob')
    'Bob and Alice'

Formatting Example with format

>>> for x in range(1,11):
...     print('{:2} {:3} {:4}'.format(x,x**2,x**3))
...
 1   1    1
 2   4    8
 3   9   27
 4  16   64
 5  25  125
 6  36  216
 7  49  343
 8  64  512
 9  81  729
10 100 1000

format Specifiers

  • An optional ’:’ and format specifier can follow the format field name

  • Format specifiers enable greater control over formatted values

  • Syntax (all are optional): {:[1][2][3][4][5][6][7][8]}

    1. fill and alignment

    2. sign

    3. ’#’ – alternate number form

    4. ’0’ – sign aware zero padding

    5. width

    6. grouping

    7. ‘.’ followed by integer – precision

    8. type

format Specifiers: Width Option

  • The width option is a positive integer that specifies padding

  • If the width is too small, nothing happens

  • Example:

    >>> '{:10}'.format('hello')
    'hello     '
    >>> '{:2}'.format('hello')
    'hello'

format Specifiers: Fill and Alignment

  • Alignment options:

    • ’<’: left align

    • ’>’: right align

    • ’=’: pad after sign (if any) but before digits

    • ’^’: center align

  • A fill character can optionally be specified before the alignment character

  • Example:

    >>> '{:^11}'.format('hello')
    '   hello   '
    >>> '{:>11}'.format('hello')
    '      hello'
    >>> '{:*>11}'.format('hello')
    '******hello'
    >>> '{:-^11}'.format('hello')
    '---hello---'

format Specifiers: Sign option

  • The sign option is only valid for number types

    • ’+’: sign should be used for both positive and negative numbers

    • ’-’: sign should be used for only negative numbers (default)

    • ’ ’ (space): leading space for positive numbers and minus sign for negative numbers

  • Examples

    >>> '{:+}'.format(123)
    '+123'
    >>> '{: }'.format(123)
    ' 123'
    >>> '{:+}'.format(1.414)
    '+1.414'

format Specifiers: #

  • The # option causes a type specific “alternate form” to be used for the conversion

  • The # can only be used for integer, float, complex, and Decimal types

  • Example

    >>> # print 123 in binary
    >>> '{:b}'.format(123)
    '1111011'
    >>> '{:#b}'.format(123)
    '0b1111011'

format Specifiers: 0

  • The 0 (preceding the width option) enables sign aware zero-padding for numeric types

  • Example

    >>> '{:010}'.format(123)
    '0000000123'
    >>> '{:010}'.format(1.414)
    '000001.414'

format Specifiers: Grouping Option

  • The grouping option specifies a character for separating thousands in numbers

  • The grouping option can be either ’_’ or ’,’

  • Example

    >>> '{:_}'.format(1000000)
    '1_000_000'
    >>> '{:,}'.format(1000000)
    '1,000,000'

format Specifiers: Precision Option

  • The precision option specifies how many digits to be displayed:

    • after the decimal point for fixed point floating point values

    • before and after the decimal point for general floating point values

  • Example

    >>> x = math.sqrt(2)
    >>> '{:.2}'.format(x)
    '1.4'
    >>> '{:.2f}'.format(x)
    '1.41'
    >>> '{:.6}'.format(x)
    '1.41421'
    >>> '{:.6f}'.format(x)
    '1.414214'

format Specifiers: Type Option

  • String option:

    • ’s’: string
  • Common integer options:

    • ’b’: binary

    • ’c’: character

    • ’d’: decimal

    • ’o’: octal

    • ’x’: hex (lower case)

    • ’X’: hex (upper case)

format Specifiers: Type Option (Continued)

  • Common float options:

    • ’e’: exponent notation

    • ’E’: exponent notation (upper case E)

    • ’f’: fixed point

    • ’F’: fixed point (upper case NAN and INF)

    • ’g’: general format

    • ’%’: percent (multiply by 100)

format Specifiers: Type Option

  • Integer examples

    >>> '{:d}'.format(123)
    '123'
    >>> '{:b}'.format(123)
    '1111011'
    >>> '{:X}'.format(123)
    '7B'
  • Floating point examples

    >>> '{:f}'.format(1.414)
    '1.414000'
    >>> '{:E}'.format(1.414)
    '1.414000E+00'
    >>> '{:%}'.format(1.414)
    '141.400000%'

Pattern Matching with Regular Expressions

  • Regular expressions are means of flexible pattern matching in strings

  • The Python interface to regular expressions is the re module

  • Some re methods:
    • compile
    • split
    • match
    • search
    • sub

Pattern Matching with Regular Expressions

  • Example

    >>> import re
    >>> regex = re.compile('\s+')
    >>> line = 'the quick brow fox jumped over a lazy dog'
    >>> regex.split(line)
    ['the', 'quick', 'brow', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']
  • The pattern \s+ matches any whitespace character

Pattern Matching with Regular Expressions

  • Example: comparing string methods

    >>> line.index('fox')
    15
    >>> regex = re.compile('fox')
    >>> match = regex.search(line)
    >>> match.start()
    15
    >>> line.replace('fox', 'BEAR')
    'the quick brow BEAR jumped over a lazy dog'
    >>> regex.sub('BEAR', line)
    'the quick brow BEAR jumped over a lazy dog'

Basic Regular Expression Syntax

  • Simple Strings are matched exactly

    >>> regex = re.compile('ion')
    >>> regex.findall('Great Expectations')
    ['ion']
  • Some characters have special meanings:

    . ^ $ * + ? { } [ ] \ | ( )

    • These characters need to be escaped with a backslash (\) to match them

Regular Expression Escape Characters

  • Example: the ‘r’ preface indicates a raw string

    >>> regex = re.compile(r'\$')
    >>> regex.findall("the cost is $20')
    ['$']
  • Example: strings versus raw strings

    >>> print('a\tb\tc')
    a    b    c
    >>> print(r'a\tb\tc')
    a\tb\tc

Matching Character Groups

  • The backslash can be used to give normal characters special meaning

    character description
    "\d" match any digit
    "\D" match any non-digit
    "\s" match any whitespace
    "\S" match any non-whitespace
    "\w" match any alphanumeric char
    "\W" match any non-alphanumeric char
  • Example

    >>> regex = re.compile(r'\w\s\w')
    >>> regex.findall('the fox is 9 years old')
    ['e f', 'x i', 's 9', 's o']

Matching Custom Character Groups

  • The square brackets specify a set of characters

    >>> regex = re.compile('[aeiou]')
    >>> regex.split('consequential')
    ['c', 'ns', 'q', '', 'nt', '', 'l']
  • The dash can be used to specify a range

    >>> regex = re.compile('[A-Z][0-9]')
    >>> regex.findall('1043879, G2, H6')
    ['G2', 'H6']

Matching Repeated Characters

  • Curly braces with a number specify repetition

    >>> regex = re.compile(r'\w{3}')
    >>> regex.findall('The quick brown fox')
    ['The', 'qui', 'bro', 'fox']

Matching Repeated Characters

character description
? match zero or one
* match zero or more
+ match one or more
{n} match n repetitions
{m,n} match between ‘m’ and ‘n’

Matching Repeated Characters

  • Example: email address matcher

    email = re.compile(r'[\w.]+@\w+\.[a-z]{3}')
    • "[\w+]" one or more alphanumeric characters or periods
    • "@" at sign
    • "\w+" one or more alphanumeric characters
    • "\." period
    • "[a-z]" exactly three lower case characters