Python String Manipulation and Regular Expressions

CSC 223 - Advanced Scientific Programming

Formatting Strings: Adjusting Case

upper
lower
capitalize
title
swapcase

Formatting Strings: Adjusting Case

Examples

>>> fox = "tHe qUICk bROWn fOx'
>>> fox.upper()
'THE QUICK BROWN FOX
>>> fox.title()
'The Quick Brown Fox'
>>> fox.swapcase()
'ThE QuicK BrowN FoX'

Formatting Strings: Adding and Removing Spaces

strip
rstrip
lstrip
center
ljust
rjust

Formatting Strings: Removing Spaces

Example:

>>> line = '   stuff   '
>>> line.strip()
'stuff'

The strip methods can remove arbitrary characters

>>> num = "000000000042"
>>> num.strip('0')
'42'

Formatting Strings: Adding Spaces

Examples:

>>> line = 'stuff'
>>> line.center(20)
'       stuff        '
>>> line.ljust(20)
'stuff               '
>>> '42'.rjust(10, '0')
'0000000042'

Finding and Replacing Substrings

find
rfind
index
rindex
endswith
startswith
replace

Finding and Replacing Substrings

Examples:

>>> line = 'Hello world'
>>> line.find('world')
6
>>> line.rfind('o')
7
>>> line.startswith('He')
True
>>> line.replace('o', '---')
'Hell--- w---rld'

Splitting and Partitioning Strings

partition
rpartition
split
splitlines
join

Splitting and Partitioning Strings

Examples:

>>> line = 'The quick brown fox jumped over a lazy dog'
>>> line.partition('fox')
('The quick brown ', 'fox', ' jumped over a lazy dog')
>>> line.split()
['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']
>>> '--'.join(['1', '2', '3'])
'1--2--3'

`format` Basics

The format method requires that the string have format fields indicated with curly braces.
The format fields are replaced with objects passed into the format method.

Example:

>>> '{} and {}'.format('Alice', 'Bob')
'Alice and Bob'

`format` Arguments

The format fields can be replaced by position or keyword

Position example:

>>> '{1} and {0}'.format('Alice', 'Bob')
'Bob and Alice'

Keyword example:

>>> '{a} and {b}'.format(a='Alice', b='Bob')
'Alice and Bob'

Mixed example:

>>> '{b} and {0}'.format('Alice', b='Bob')
'Bob and Alice'

Formatting Example with `format`

>>> for x in range(1,11):
...     print('{:2} {:3} {:4}'.format(x,x**2,x**3))
...
 1   1    1
 2   4    8
 3   9   27
 4  16   64
 5  25  125
 6  36  216
 7  49  343
 8  64  512
 9  81  729
10 100 1000

`format` Specifiers

An optional ’:’ and format specifier can follow the format field name
Format specifiers enable greater control over formatted values
Syntax (all are optional): {:[1][2][3][4][5][6][7][8]}
1. fill and alignment
2. sign
3. ’#’ – alternate number form
4. ’0’ – sign aware zero padding
5. width
6. grouping
7. ‘.’ followed by integer – precision
8. type

`format` Specifiers: Width Option

The width option is a positive integer that specifies padding
If the width is too small, nothing happens

Example:

>>> '{:10}'.format('hello')
'hello     '
>>> '{:2}'.format('hello')
'hello'

`format` Specifiers: Fill and Alignment

Alignment options:
- ’<’: left align
- ’>’: right align
- ’=’: pad after sign (if any) but before digits
- ’^’: center align
A fill character can optionally be specified before the alignment character

Example:

>>> '{:^11}'.format('hello')
'   hello   '
>>> '{:>11}'.format('hello')
'      hello'
>>> '{:*>11}'.format('hello')
'******hello'
>>> '{:-^11}'.format('hello')
'---hello---'

`format` Specifiers: Sign option

The sign option is only valid for number types
- ’+’: sign should be used for both positive and negative numbers
- ’-’: sign should be used for only negative numbers (default)
- ’ ’ (space): leading space for positive numbers and minus sign for negative numbers

Examples

>>> '{:+}'.format(123)
'+123'
>>> '{: }'.format(123)
' 123'
>>> '{:+}'.format(1.414)
'+1.414'

`format` Specifiers: `#`

The # option causes a type specific “alternate form” to be used for the conversion
The # can only be used for integer, float, complex, and Decimal types

Example

>>> # print 123 in binary
>>> '{:b}'.format(123)
'1111011'
>>> '{:#b}'.format(123)
'0b1111011'

`format` Specifiers: `0`

The 0 (preceding the width option) enables sign aware zero-padding for numeric types

Example

>>> '{:010}'.format(123)
'0000000123'
>>> '{:010}'.format(1.414)
'000001.414'

`format` Specifiers: Grouping Option

The grouping option specifies a character for separating thousands in numbers
The grouping option can be either ’_’ or ’,’

Example

>>> '{:_}'.format(1000000)
'1_000_000'
>>> '{:,}'.format(1000000)
'1,000,000'

`format` Specifiers: Precision Option

The precision option specifies how many digits to be displayed:
- after the decimal point for fixed point floating point values
- before and after the decimal point for general floating point values

Example

>>> x = math.sqrt(2)
>>> '{:.2}'.format(x)
'1.4'
>>> '{:.2f}'.format(x)
'1.41'
>>> '{:.6}'.format(x)
'1.41421'
>>> '{:.6f}'.format(x)
'1.414214'

`format` Specifiers: Type Option

String option:
- ’s’: string
Common integer options:
- ’b’: binary
- ’c’: character
- ’d’: decimal
- ’o’: octal
- ’x’: hex (lower case)
- ’X’: hex (upper case)

`format` Specifiers: Type Option (Continued)

Common float options:
- ’e’: exponent notation
- ’E’: exponent notation (upper case E)
- ’f’: fixed point
- ’F’: fixed point (upper case NAN and INF)
- ’g’: general format
- ’%’: percent (multiply by 100)

`format` Specifiers: Type Option

Integer examples

>>> '{:d}'.format(123)
'123'
>>> '{:b}'.format(123)
'1111011'
>>> '{:X}'.format(123)
'7B'

Floating point examples

>>> '{:f}'.format(1.414)
'1.414000'
>>> '{:E}'.format(1.414)
'1.414000E+00'
>>> '{:%}'.format(1.414)
'141.400000%'

Pattern Matching with Regular Expressions

Regular expressions are means of flexible pattern matching in strings
The Python interface to regular expressions is the re module
Some re methods:
- compile
- split
- match
- search
- sub

Pattern Matching with Regular Expressions

Example

>>> import re
>>> regex = re.compile('\s+')
>>> line = 'the quick brow fox jumped over a lazy dog'
>>> regex.split(line)
['the', 'quick', 'brow', 'fox', 'jumped', 'over', 'a', 'lazy', 'dog']

The pattern \s+ matches any whitespace character

Pattern Matching with Regular Expressions

Example: comparing string methods

>>> line.index('fox')
15
>>> regex = re.compile('fox')
>>> match = regex.search(line)
>>> match.start()
15
>>> line.replace('fox', 'BEAR')
'the quick brow BEAR jumped over a lazy dog'
>>> regex.sub('BEAR', line)
'the quick brow BEAR jumped over a lazy dog'

Basic Regular Expression Syntax

Simple Strings are matched exactly

>>> regex = re.compile('ion')
>>> regex.findall('Great Expectations')
['ion']

Some characters have special meanings:

. ^ $ * + ? { } [ ] \ | ( )
- These characters need to be escaped with a backslash (\) to match them

Regular Expression Escape Characters

Example: the ‘r’ preface indicates a raw string

>>> regex = re.compile(r'\$')
>>> regex.findall("the cost is $20')
['$']

Example: strings versus raw strings

>>> print('a\tb\tc')
a    b    c
>>> print(r'a\tb\tc')
a\tb\tc

Matching Character Groups

The backslash can be used to give normal characters special meaning

character	description
`"\d"`	match any digit
`"\D"`	match any non-digit
`"\s"`	match any whitespace
`"\S"`	match any non-whitespace
`"\w"`	match any alphanumeric char
`"\W"`	match any non-alphanumeric char

Example

>>> regex = re.compile(r'\w\s\w')
>>> regex.findall('the fox is 9 years old')
['e f', 'x i', 's 9', 's o']

Matching Custom Character Groups

The square brackets specify a set of characters

>>> regex = re.compile('[aeiou]')
>>> regex.split('consequential')
['c', 'ns', 'q', '', 'nt', '', 'l']

The dash can be used to specify a range

>>> regex = re.compile('[A-Z][0-9]')
>>> regex.findall('1043879, G2, H6')
['G2', 'H6']

Matching Repeated Characters

Curly braces with a number specify repetition

>>> regex = re.compile(r'\w{3}')
>>> regex.findall('The quick brown fox')
['The', 'qui', 'bro', 'fox']

Matching Repeated Characters

character	description
`?`	match zero or one
`*`	match zero or more
`+`	match one or more
`{n}`	match `n` repetitions
`{m,n}`	match between ‘m’ and ‘n’

Matching Repeated Characters

Example: email address matcher
```
email = re.compile(r'[\w.]+@\w+\.[a-z]{3}')
```
- "[\w+]" one or more alphanumeric characters or periods
- "@" at sign
- "\w+" one or more alphanumeric characters
- "\." period
- "[a-z]" exactly three lower case characters

Python String Manipulation and Regular Expressions

CSC 223 - Advanced Scientific Programming

Formatting Strings: Adjusting Case

Formatting Strings: Adjusting Case

Formatting Strings: Adding and Removing Spaces

Formatting Strings: Removing Spaces

Formatting Strings: Adding Spaces

Finding and Replacing Substrings

Finding and Replacing Substrings

Splitting and Partitioning Strings

Splitting and Partitioning Strings

format Basics

format Arguments

Formatting Example with format

format Specifiers

format Specifiers: Width Option

format Specifiers: Fill and Alignment

format Specifiers: Sign option

format Specifiers: #

format Specifiers: 0

format Specifiers: Grouping Option

format Specifiers: Precision Option

format Specifiers: Type Option

format Specifiers: Type Option (Continued)

format Specifiers: Type Option

Pattern Matching with Regular Expressions

Pattern Matching with Regular Expressions

Pattern Matching with Regular Expressions

Basic Regular Expression Syntax

Regular Expression Escape Characters

Matching Character Groups

Matching Custom Character Groups

Matching Repeated Characters

Matching Repeated Characters

Matching Repeated Characters

`format` Basics

`format` Arguments

Formatting Example with `format`

`format` Specifiers

`format` Specifiers: Width Option

`format` Specifiers: Fill and Alignment

`format` Specifiers: Sign option

`format` Specifiers: `#`

`format` Specifiers: `0`

`format` Specifiers: Grouping Option

`format` Specifiers: Precision Option

`format` Specifiers: Type Option

`format` Specifiers: Type Option (Continued)

`format` Specifiers: Type Option